World Models
Active FrontierWorld Models
A world model is a learned internal simulator that predicts how the environment will evolve in response to actions. The concept has become the central alternative frame to autoregressive LLMs for reaching general intelligence: rather than predicting the next token, a world model predicts the next state of the world, enabling the system to plan by "imagining" the consequences of candidate actions.
The field is split between two competing philosophies that both claim the "world model" label: generative models (predict every pixel/frame of the future, e.g., Sora, Genie 3) and joint-embedding predictive architectures (predict abstract representations of the future, skipping unpredictable detail, e.g., V-JEPA 2). This schism defines the frontier — both camps have strong 2026 deployments and neither has decisively won.
World models are the core mechanism required for System 2 reasoning in embodied agents: perceive state → use the world model to simulate candidate action sequences → evaluate against an objective → execute the best sequence. Without a world model, an agent is limited to reactive (System 1) behavior.
Key Claims
-
Two-function taxonomy splits the literature — world models either build internal representations to understand current state or predict future states to guide decisions. Most deployed systems instantiate both. Evidence: strong (Tsinghua Survey)
-
Pixel-level prediction is computationally wasteful for planning — StructVLA demonstrates that sparse, physically meaningful keyframe prediction outperforms dense rollouts for robotic manipulation. Supports LeCun's long-standing critique of generative video as a planning substrate. Evidence: moderate (StructVLA)
-
JEPA can be trained stably end-to-end from raw pixels — LeWorldModel achieves stable JEPA training with only two loss terms, removing the EMA/distillation tricks earlier JEPAs required. Evidence: moderate (LeWM)
-
V-JEPA 2 achieves zero-shot robotic planning from passive video pre-training — 1M+ hours of internet video + <62 hours of robot videos enables zero-shot pick-and-place on Franka arms across two different labs. The strongest evidence to date that SSL on video produces deployable world models. Evidence: strong (V-JEPA 2)
-
Generative world models have reached real-time interactive scale — Genie 3 (11B params, autoregressive transformer) generates 720p interactive worlds at 24fps with ~1 minute visual memory. Evidence: strong (vendor source) (Genie 3)
-
Hierarchical (symbolic + visual) world models mitigate error accumulation — H-WM demonstrates that combining a high-level logical predictor with a low-level visual predictor reduces drift in long-horizon TAMP problems. Evidence: moderate (H-WM)
-
Object-centric masking can induce causal structure in JEPA — Causal-JEPA extends masked prediction with object-level interventions to produce counterfactual-like effects. Evidence: moderate (C-JEPA)
-
Autonomous driving is the first major commercial deployment — every major AV company runs some form of world model internally for sim-to-real, counterfactual evaluation, and data augmentation. Evidence: strong (AD Survey)
Benchmarks & Data
| Metric | Value | System | Source |
|---|---|---|---|
| Something-Something v2 (motion) | 77.3% top-1 | V-JEPA 2 | V-JEPA 2 |
| Epic-Kitchens-100 (anticipation) | 39.7 recall-at-5 | V-JEPA 2 | V-JEPA 2 |
| Video pre-training scale | 1M+ hours | V-JEPA 2 | V-JEPA 2 |
| Robot post-training | <62 hours video | V-JEPA 2-AC | V-JEPA 2 |
| Interactive generation | 11B params, 720p/24fps | Genie 3 | Genie 3 |
| Genie 3 consistency horizon | ~minutes; ~1 min visual memory | Genie 3 | Genie 3 |
Open Questions
- Can generative world models scale to hours-long horizons? Genie 3 maintains consistency for minutes. Whether the pixel-prediction approach scales to hour-long coherent simulation — and at what compute cost — is unresolved.
- Does LeWM's simplicity hold at V-JEPA 2 scale? Stable end-to-end pixel JEPA is elegant at small scale; the crucial test is whether it holds at billion-parameter / million-video-hour regimes.
- Which approach produces better robotic control? V-JEPA 2-AC shows strong pick-and-place, but multi-step long-horizon manipulation with either JEPA or generative world models is still largely unsolved.
- How do world models integrate with LLMs? H-WM's hierarchical symbolic-plus-visual approach hints at a fusion; whether the symbolic layer should be an LLM, a rule-based planner, or a learned discrete latent is open.
- What's the right evaluation metric? Pixel fidelity measures generative quality but not planning quality. Physical consistency metrics (as called for in the Embodied AI Survey) don't yet have a standard.
Related Concepts
- Joint Embedding Predictive Architecture (JEPA) — LeCun's proposed world-model recipe; predicts abstract representations rather than pixels
- Self-Supervised Learning — the learning paradigm that makes world models trainable on unlabeled video at scale
- System 2 Reasoning — deliberative planning via optimization; world models are the inner simulator it uses
- Hierarchical Planning — multi-level abstraction (logical + visual) for long-horizon control
- Generative World Models — the opposing camp: pixel-space prediction (Sora, Genie 3) as a world-model approach
- Vision-Language-Action Models — VLAs increasingly adopt world-model capabilities for look-ahead planning
Cross-Topic Links
- Robotics — world models are the mechanism behind modern robotic foundation models; V-JEPA 2-AC and H-WM are dual-citizens of AI and robotics research.
- Optical computing / hardware — world-model scaling (especially video-based training) drives demand for high-memory-bandwidth compute.
Backlinks
Pages that reference this concept:
- Vision-Language-Action Models — VLAs integrate RL, world models, and human video learning
- frontier.md — world models listed as an active frontier area
Changelog
- 2026-04-22 — Initial compilation from 11 sources across surveys, JEPA, generative, and robotic-application papers
Related Concepts
Generative World Models
Active FrontierHierarchical Planning
Active FrontierJoint Embedding Predictive Architecture (JEPA)
Active FrontierSelf-Supervised Learning (SSL)
Active FrontierSystem 2 Reasoning (Objective-Driven AI)
Active FrontierVision-Language-Action Models
Active FrontierTheses that depend on this concept
These research positions cite this concept in their evidence. If the concept changes materially, these theses may need re-scoring.