World Models

Active Frontier

world-modelsembodied-aiself-supervised-learningplanning

World Models

A world model is a learned internal simulator that predicts how the environment will evolve in response to actions. The concept has become the central alternative frame to autoregressive LLMs for reaching general intelligence: rather than predicting the next token, a world model predicts the next state of the world, enabling the system to plan by "imagining" the consequences of candidate actions.

The field is split between two competing philosophies that both claim the "world model" label: generative models (predict every pixel/frame of the future, e.g., Sora, Genie 3) and joint-embedding predictive architectures (predict abstract representations of the future, skipping unpredictable detail, e.g., V-JEPA 2). This schism defines the frontier — both camps have strong 2026 deployments and neither has decisively won.

World models are the core mechanism required for System 2 reasoning in embodied agents: perceive state → use the world model to simulate candidate action sequences → evaluate against an objective → execute the best sequence. Without a world model, an agent is limited to reactive (System 1) behavior.

Key Claims

Two-function taxonomy splits the literature — world models either build internal representations to understand current state or predict future states to guide decisions. Most deployed systems instantiate both. Evidence: strong (Tsinghua Survey)
Pixel-level prediction is computationally wasteful for planning — StructVLA demonstrates that sparse, physically meaningful keyframe prediction outperforms dense rollouts for robotic manipulation. Supports LeCun's long-standing critique of generative video as a planning substrate. Evidence: moderate (StructVLA)
JEPA can be trained stably end-to-end from raw pixels — LeWorldModel achieves stable JEPA training with only two loss terms, removing the EMA/distillation tricks earlier JEPAs required. Evidence: moderate (LeWM)
V-JEPA 2 achieves zero-shot robotic planning from passive video pre-training — 1M+ hours of internet video + <62 hours of robot videos enables zero-shot pick-and-place on Franka arms across two different labs. The strongest evidence to date that SSL on video produces deployable world models. Evidence: strong (V-JEPA 2)
Generative world models have reached real-time interactive scale — Genie 3 (11B params, autoregressive transformer) generates 720p interactive worlds at 24fps with ~1 minute visual memory. Evidence: strong (vendor source) (Genie 3)
Hierarchical (symbolic + visual) world models mitigate error accumulation — H-WM demonstrates that combining a high-level logical predictor with a low-level visual predictor reduces drift in long-horizon TAMP problems. Evidence: moderate (H-WM)
Object-centric masking can induce causal structure in JEPA — Causal-JEPA extends masked prediction with object-level interventions to produce counterfactual-like effects. Evidence: moderate (C-JEPA)
Autonomous driving is the first major commercial deployment — every major AV company runs some form of world model internally for sim-to-real, counterfactual evaluation, and data augmentation. Evidence: strong (AD Survey)

Benchmarks & Data

Metric	Value	System	Source
Something-Something v2 (motion)	77.3% top-1	V-JEPA 2	V-JEPA 2
Epic-Kitchens-100 (anticipation)	39.7 recall-at-5	V-JEPA 2	V-JEPA 2
Video pre-training scale	1M+ hours	V-JEPA 2	V-JEPA 2
Robot post-training	<62 hours video	V-JEPA 2-AC	V-JEPA 2
Interactive generation	11B params, 720p/24fps	Genie 3	Genie 3
Genie 3 consistency horizon	~minutes; ~1 min visual memory	Genie 3	Genie 3

Open Questions

Can generative world models scale to hours-long horizons? Genie 3 maintains consistency for minutes. Whether the pixel-prediction approach scales to hour-long coherent simulation — and at what compute cost — is unresolved.
Does LeWM's simplicity hold at V-JEPA 2 scale? Stable end-to-end pixel JEPA is elegant at small scale; the crucial test is whether it holds at billion-parameter / million-video-hour regimes.
Which approach produces better robotic control? V-JEPA 2-AC shows strong pick-and-place, but multi-step long-horizon manipulation with either JEPA or generative world models is still largely unsolved.
How do world models integrate with LLMs? H-WM's hierarchical symbolic-plus-visual approach hints at a fusion; whether the symbolic layer should be an LLM, a rule-based planner, or a learned discrete latent is open.
What's the right evaluation metric? Pixel fidelity measures generative quality but not planning quality. Physical consistency metrics (as called for in the Embodied AI Survey) don't yet have a standard.

Related Concepts

Joint Embedding Predictive Architecture (JEPA) — LeCun's proposed world-model recipe; predicts abstract representations rather than pixels
Self-Supervised Learning — the learning paradigm that makes world models trainable on unlabeled video at scale
System 2 Reasoning — deliberative planning via optimization; world models are the inner simulator it uses
Hierarchical Planning — multi-level abstraction (logical + visual) for long-horizon control
Generative World Models — the opposing camp: pixel-space prediction (Sora, Genie 3) as a world-model approach
Vision-Language-Action Models — VLAs increasingly adopt world-model capabilities for look-ahead planning

Cross-Topic Links

Robotics — world models are the mechanism behind modern robotic foundation models; V-JEPA 2-AC and H-WM are dual-citizens of AI and robotics research.
Optical computing / hardware — world-model scaling (especially video-based training) drives demand for high-memory-bandwidth compute.

Backlinks

Pages that reference this concept:

Vision-Language-Action Models — VLAs integrate RL, world models, and human video learning
frontier.md — world models listed as an active frontier area

Changelog

2026-04-22 — Initial compilation from 11 sources across surveys, JEPA, generative, and robotic-application papers

Related Concepts

Theses that depend on this concept

These research positions cite this concept in their evidence. If the concept changes materially, these theses may need re-scoring.

T4Anever reviewed

World models will be a central architectural story of embodied/robotics AI by 2030

6.0/10

no history yet

T2roboticsnever reviewed

Consumer humanoids ($20K-$50K) will achieve meaningful household utility by 2028

6.0/10

no history yet

Sources

world-models-survey-tsinghua embodied-ai-world-models-survey 3d-4d-world-modeling-survey robotic-manipulation-world-models-survey autonomous-driving-world-models-survey v-jepa-2 genie-3-deepmind leworldmodel-stable-jepa-pixels causal-jepa hierarchical-world-model-tamp structvla-beyond-dense-futures

World Models

Active Frontier

world-modelsembodied-aiself-supervised-learningplanning

World Models

Key Claims

Two-function taxonomy splits the literature — world models either build internal representations to understand current state or predict future states to guide decisions. Most deployed systems instantiate both. Evidence: strong (Tsinghua Survey)
Pixel-level prediction is computationally wasteful for planning — StructVLA demonstrates that sparse, physically meaningful keyframe prediction outperforms dense rollouts for robotic manipulation. Supports LeCun's long-standing critique of generative video as a planning substrate. Evidence: moderate (StructVLA)
JEPA can be trained stably end-to-end from raw pixels — LeWorldModel achieves stable JEPA training with only two loss terms, removing the EMA/distillation tricks earlier JEPAs required. Evidence: moderate (LeWM)
V-JEPA 2 achieves zero-shot robotic planning from passive video pre-training — 1M+ hours of internet video + <62 hours of robot videos enables zero-shot pick-and-place on Franka arms across two different labs. The strongest evidence to date that SSL on video produces deployable world models. Evidence: strong (V-JEPA 2)
Generative world models have reached real-time interactive scale — Genie 3 (11B params, autoregressive transformer) generates 720p interactive worlds at 24fps with ~1 minute visual memory. Evidence: strong (vendor source) (Genie 3)
Hierarchical (symbolic + visual) world models mitigate error accumulation — H-WM demonstrates that combining a high-level logical predictor with a low-level visual predictor reduces drift in long-horizon TAMP problems. Evidence: moderate (H-WM)
Object-centric masking can induce causal structure in JEPA — Causal-JEPA extends masked prediction with object-level interventions to produce counterfactual-like effects. Evidence: moderate (C-JEPA)
Autonomous driving is the first major commercial deployment — every major AV company runs some form of world model internally for sim-to-real, counterfactual evaluation, and data augmentation. Evidence: strong (AD Survey)

Benchmarks & Data

Metric	Value	System	Source
Something-Something v2 (motion)	77.3% top-1	V-JEPA 2	V-JEPA 2
Epic-Kitchens-100 (anticipation)	39.7 recall-at-5	V-JEPA 2	V-JEPA 2
Video pre-training scale	1M+ hours	V-JEPA 2	V-JEPA 2
Robot post-training	<62 hours video	V-JEPA 2-AC	V-JEPA 2
Interactive generation	11B params, 720p/24fps	Genie 3	Genie 3
Genie 3 consistency horizon	~minutes; ~1 min visual memory	Genie 3	Genie 3

Open Questions

Can generative world models scale to hours-long horizons? Genie 3 maintains consistency for minutes. Whether the pixel-prediction approach scales to hour-long coherent simulation — and at what compute cost — is unresolved.
Does LeWM's simplicity hold at V-JEPA 2 scale? Stable end-to-end pixel JEPA is elegant at small scale; the crucial test is whether it holds at billion-parameter / million-video-hour regimes.
Which approach produces better robotic control? V-JEPA 2-AC shows strong pick-and-place, but multi-step long-horizon manipulation with either JEPA or generative world models is still largely unsolved.
How do world models integrate with LLMs? H-WM's hierarchical symbolic-plus-visual approach hints at a fusion; whether the symbolic layer should be an LLM, a rule-based planner, or a learned discrete latent is open.
What's the right evaluation metric? Pixel fidelity measures generative quality but not planning quality. Physical consistency metrics (as called for in the Embodied AI Survey) don't yet have a standard.

Related Concepts

Joint Embedding Predictive Architecture (JEPA) — LeCun's proposed world-model recipe; predicts abstract representations rather than pixels
Self-Supervised Learning — the learning paradigm that makes world models trainable on unlabeled video at scale
System 2 Reasoning — deliberative planning via optimization; world models are the inner simulator it uses
Hierarchical Planning — multi-level abstraction (logical + visual) for long-horizon control
Generative World Models — the opposing camp: pixel-space prediction (Sora, Genie 3) as a world-model approach
Vision-Language-Action Models — VLAs increasingly adopt world-model capabilities for look-ahead planning

Cross-Topic Links

Robotics — world models are the mechanism behind modern robotic foundation models; V-JEPA 2-AC and H-WM are dual-citizens of AI and robotics research.
Optical computing / hardware — world-model scaling (especially video-based training) drives demand for high-memory-bandwidth compute.