PAPER2025-06-11·Meta FAIR·arXiv 2506.09985

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Yann LeCun, et al. (30 authors)

COMPILED NOTES

Action-free JEPA pre-trained on 1M+ hours of video; V-JEPA 2-AC post-training on <62h robot video enables zero-shot pick-and-place on Franka arms

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Abstract

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world.

Key Contributions

Internet-scale pre-training — action-free V-JEPA 2 trained on over 1 million hours of video and image data
V-JEPA 2-AC — action-conditioned post-training using under 62 hours of unlabeled robot videos from the Droid dataset
Zero-shot robotic deployment — deployed on Franka arms in two different labs for pick-and-place with image goals, no task-specific fine-tuning
Bridge between passive observation and active control — shows SSL video models can be adapted to control without billions of robot-hours

Methodology

Pre-train a joint-embedding predictive architecture on 1M+ hours of internet video (action-free)
Post-train an action-conditioned variant on Droid robot videos (<62 hours)
Plan in latent space using image goals — no pixel-level reconstruction required
Evaluate on perception benchmarks and deploy zero-shot on Franka manipulation

Results

Benchmark	V-JEPA 2	Note
Something-Something v2	77.3% top-1	Motion understanding SOTA-tier
Epic-Kitchens-100	39.7 recall-at-5	Human action anticipation SOTA
PerceptionTest	84.0	Video QA at 8B parameters
TempCompass	76.9	Video QA temporal reasoning
Franka pick-and-place	Zero-shot success	Two independent labs, image goals

Limitations

Not explicitly enumerated in the abstract. Implicit limitations from the JEPA framework: encoder collapse prevention relies on distillation/EMA tricks; latent-space planning quality depends on the representational content chosen by the predictor; longer-horizon planning beyond pick-and-place not demonstrated in the headline results.

Full Content

48 pages, 19 figures. Direct successor to V-JEPA (2024). Operationalizes LeCun's position that self-supervised video learning + small action-conditioned adapters can produce deployable world models without billions of robot-hours.

Source: V-JEPA 2 by Mido Assran, Yann LeCun, et al., Meta FAIR

RELATED · IN THE BASE

PAPER2025-06-11·Meta FAIR·arXiv 2506.09985

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Yann LeCun, et al. (30 authors)

COMPILED NOTES

Action-free JEPA pre-trained on 1M+ hours of video; V-JEPA 2-AC post-training on <62h robot video enables zero-shot pick-and-place on Franka arms

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Abstract

Key Contributions

Internet-scale pre-training — action-free V-JEPA 2 trained on over 1 million hours of video and image data
V-JEPA 2-AC — action-conditioned post-training using under 62 hours of unlabeled robot videos from the Droid dataset
Zero-shot robotic deployment — deployed on Franka arms in two different labs for pick-and-place with image goals, no task-specific fine-tuning
Bridge between passive observation and active control — shows SSL video models can be adapted to control without billions of robot-hours

Methodology

Pre-train a joint-embedding predictive architecture on 1M+ hours of internet video (action-free)
Post-train an action-conditioned variant on Droid robot videos (<62 hours)
Plan in latent space using image goals — no pixel-level reconstruction required
Evaluate on perception benchmarks and deploy zero-shot on Franka manipulation

Results

Benchmark	V-JEPA 2	Note
Something-Something v2	77.3% top-1	Motion understanding SOTA-tier
Epic-Kitchens-100	39.7 recall-at-5	Human action anticipation SOTA
PerceptionTest	84.0	Video QA at 8B parameters
TempCompass	76.9	Video QA temporal reasoning
Franka pick-and-place	Zero-shot success	Two independent labs, image goals

Limitations

Full Content

Source: V-JEPA 2 by Mido Assran, Yann LeCun, et al., Meta FAIR

RELATED · IN THE BASE