V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Action-free JEPA pre-trained on 1M+ hours of video; V-JEPA 2-AC post-training on <62h robot video enables zero-shot pick-and-place on Franka arms
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Abstract
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world.
Key Contributions
- Internet-scale pre-training — action-free V-JEPA 2 trained on over 1 million hours of video and image data
- V-JEPA 2-AC — action-conditioned post-training using under 62 hours of unlabeled robot videos from the Droid dataset
- Zero-shot robotic deployment — deployed on Franka arms in two different labs for pick-and-place with image goals, no task-specific fine-tuning
- Bridge between passive observation and active control — shows SSL video models can be adapted to control without billions of robot-hours
Methodology
- Pre-train a joint-embedding predictive architecture on 1M+ hours of internet video (action-free)
- Post-train an action-conditioned variant on Droid robot videos (<62 hours)
- Plan in latent space using image goals — no pixel-level reconstruction required
- Evaluate on perception benchmarks and deploy zero-shot on Franka manipulation
Results
| Benchmark | V-JEPA 2 | Note |
|---|---|---|
| Something-Something v2 | 77.3% top-1 | Motion understanding SOTA-tier |
| Epic-Kitchens-100 | 39.7 recall-at-5 | Human action anticipation SOTA |
| PerceptionTest | 84.0 | Video QA at 8B parameters |
| TempCompass | 76.9 | Video QA temporal reasoning |
| Franka pick-and-place | Zero-shot success | Two independent labs, image goals |
Limitations
Not explicitly enumerated in the abstract. Implicit limitations from the JEPA framework: encoder collapse prevention relies on distillation/EMA tricks; latent-space planning quality depends on the representational content chosen by the predictor; longer-horizon planning beyond pick-and-place not demonstrated in the headline results.
Full Content
48 pages, 19 figures. Direct successor to V-JEPA (2024). Operationalizes LeCun's position that self-supervised video learning + small action-conditioned adapters can produce deployable world models without billions of robot-hours.
Source: V-JEPA 2 by Mido Assran, Yann LeCun, et al., Meta FAIR