Joint Embedding Predictive Architecture (JEPA)

Active Frontier

world-modelsself-supervised-learningrepresentation-learningarchitecture

Joint Embedding Predictive Architecture (JEPA)

JEPA is Yann LeCun's proposed architecture for world models. Rather than reconstructing future states pixel-by-pixel (as generative models do), JEPA runs both the current and future states through encoders, then trains a predictor to operate entirely in the resulting abstract representation space.

The core insight: the world contains unpredictable detail (the exact motion of every leaf on a tree, the precise trajectory of every dust particle). A model that tries to predict every pixel wastes capacity on noise. JEPA sidesteps this by letting the encoder learn which details to discard — the predictor only needs to match the embedding of the future state, not reconstruct the future image.

The central engineering challenge is representational collapse: if the encoder is free to produce any representation, the optimal solution under a pure predictive loss is to output a constant (zero-information) vector that trivially matches. Every JEPA variant is fundamentally a different answer to "how do we prevent collapse?"

Key Claims

JEPA predicts abstract representations, not pixels — the predictor operates in latent space, freeing it from modeling unpredictable detail. Evidence: strong (V-JEPA 2)
Pre-training on 1M+ hours of video produces deployable world models — V-JEPA 2 achieves 77.3% on Something-Something v2 and enables zero-shot robotic manipulation after <62h of post-training. Evidence: strong (V-JEPA 2)
Stable end-to-end pixel training is now possible — LeWM achieves stable JEPA training with only two loss terms, removing EMA/distillation teacher networks that previous variants relied on. Evidence: moderate (LeWM)
Object-level masking can induce counterfactual structure — Causal-JEPA extends masked prediction to object-centric representations, generating latent interventions. Evidence: moderate (C-JEPA)

Collapse Prevention Techniques

Different JEPA variants handle collapse differently:

Distillation / EMA teacher (I-JEPA, V-JEPA, Dino) — the target encoder is an exponential moving average of the online encoder, breaking the trivial solution
Information maximization (SigReg / VICReg-family) — add loss terms that maximize variance and decorrelate output dimensions to prevent constant outputs
Two-loss simplification (LeWM) — minimal loss recipe that trains stably without auxiliary tricks

Benchmarks & Data

Metric	Value	System	Source
Something-Something v2	77.3% top-1	V-JEPA 2	V-JEPA 2
Epic-Kitchens-100	39.7 recall-at-5	V-JEPA 2	V-JEPA 2
PerceptionTest	84.0	V-JEPA 2 (8B)	V-JEPA 2
TempCompass	76.9	V-JEPA 2	V-JEPA 2

Open Questions

Does end-to-end pixel JEPA (LeWM-style) scale to V-JEPA 2 training budgets, or do the tricks come back at scale?
How much of V-JEPA 2's robotic success is transferable beyond tabletop pick-and-place?
Can JEPA representations be composed hierarchically (see Hierarchical Planning)?
What's the right predictor architecture — transformer, MLP, diffusion in latent space?

Related Concepts

World Models — JEPA is a specific architectural choice for implementing world models
Self-Supervised Learning — JEPA is trained via SSL
Generative World Models — the alternative approach JEPA positions itself against

Backlinks

Yann LeCun — JEPA is central to LeCun's research program
Meta FAIR — developer of the V-JEPA series

Changelog

2026-04-22 — Initial compilation from V-JEPA 2, LeWM, C-JEPA papers

Related Concepts

Theses that depend on this concept

These research positions cite this concept in their evidence. If the concept changes materially, these theses may need re-scoring.

T4Anever reviewed

World models will be a central architectural story of embodied/robotics AI by 2030

6.0/10

no history yet

T4Bnever reviewed

JEPA and generative world models will specialize to different use cases (control vs. simulation), not converge on a single architecture

6.0/10

no history yet

Sources

v-jepa-2 leworldmodel-stable-jepa-pixels causal-jepa

Joint Embedding Predictive Architecture (JEPA)

Active Frontier

world-modelsself-supervised-learningrepresentation-learningarchitecture

Joint Embedding Predictive Architecture (JEPA)

Key Claims

JEPA predicts abstract representations, not pixels — the predictor operates in latent space, freeing it from modeling unpredictable detail. Evidence: strong (V-JEPA 2)
Pre-training on 1M+ hours of video produces deployable world models — V-JEPA 2 achieves 77.3% on Something-Something v2 and enables zero-shot robotic manipulation after <62h of post-training. Evidence: strong (V-JEPA 2)
Stable end-to-end pixel training is now possible — LeWM achieves stable JEPA training with only two loss terms, removing EMA/distillation teacher networks that previous variants relied on. Evidence: moderate (LeWM)
Object-level masking can induce counterfactual structure — Causal-JEPA extends masked prediction to object-centric representations, generating latent interventions. Evidence: moderate (C-JEPA)

Collapse Prevention Techniques

Different JEPA variants handle collapse differently:

Distillation / EMA teacher (I-JEPA, V-JEPA, Dino) — the target encoder is an exponential moving average of the online encoder, breaking the trivial solution
Information maximization (SigReg / VICReg-family) — add loss terms that maximize variance and decorrelate output dimensions to prevent constant outputs
Two-loss simplification (LeWM) — minimal loss recipe that trains stably without auxiliary tricks

Benchmarks & Data

Metric	Value	System	Source
Something-Something v2	77.3% top-1	V-JEPA 2	V-JEPA 2
Epic-Kitchens-100	39.7 recall-at-5	V-JEPA 2	V-JEPA 2
PerceptionTest	84.0	V-JEPA 2 (8B)	V-JEPA 2
TempCompass	76.9	V-JEPA 2	V-JEPA 2

Open Questions

Does end-to-end pixel JEPA (LeWM-style) scale to V-JEPA 2 training budgets, or do the tricks come back at scale?
How much of V-JEPA 2's robotic success is transferable beyond tabletop pick-and-place?
Can JEPA representations be composed hierarchically (see Hierarchical Planning)?
What's the right predictor architecture — transformer, MLP, diffusion in latent space?

Related Concepts

World Models — JEPA is a specific architectural choice for implementing world models
Self-Supervised Learning — JEPA is trained via SSL
Generative World Models — the alternative approach JEPA positions itself against