World Models
Active FrontierWorld Models
World models are internal representations that allow robots to predict how their environment will change in response to actions. Unlike reactive control — where the robot responds to current sensor readings — world models enable planning by mentally simulating the consequences of potential actions before executing them. This is analogous to how humans can imagine the outcome of reaching for a cup before moving their arm.
1X Technologies released a visual perception world model for their NEO humanoid robot. The model enables NEO to observe its environment and predict future states, forming the basis for what 1X describes as enabling robots to "teach themselves new tasks" through observation rather than explicit programming. This represents a shift from scripted or imitation-based behaviors toward autonomous skill acquisition.
NVIDIA's Isaac Sim 5.1 takes a complementary approach: rather than building an internal world model, it creates a high-fidelity external simulation that serves a similar function. The deliberate injection of sensor imperfections (noise, latency, miscalibration) forces training policies to develop internal robustness — effectively requiring the learned policy to maintain its own implicit world model that accounts for uncertainty.
The distinction between internal world models (learned by the robot) and external simulation (built by engineers) is blurring. As learned models improve, they may eventually replace hand-crafted simulations for certain tasks, while simulation remains essential for domains where prediction errors are costly.
The AI-Side Research Anchor (2026-04-22)
The robotics community's world-models work (1X NEO, Isaac Sim 5.1) is being joined by a parallel research push from AI labs producing architecturally rigorous world-model recipes that translate to robotic control. The AI KB tracks this lineage in detail — see ai/wiki/concepts/world-models.md. The key result for robotics:
-
V-JEPA 2-AC (Meta FAIR, Jun 2025) — action-conditioned JEPA post-trained on <62h of Droid robot videos on top of 1M+ hours of internet video pre-training. Deployed zero-shot on Franka arms in two different labs for pick-and-place with image goals. This is the strongest current evidence that passive video pre-training plus minimal interaction data is a viable path to robotic control. (V-JEPA 2)
-
H-WM (Mar 2026) — hierarchical world model jointly predicting symbolic (logical) and visual state transitions, mitigating error accumulation in task-and-motion planning. Makes long-horizon manipulation tractable by letting a symbolic layer stabilize visual rollout. (H-WM)
-
StructVLA (Mar 2026) — rejects dense pixel rollouts for planning. Predicts sparse, physically meaningful keyframes derived from kinematic cues. Supports the bet that abstract/structured representation prediction outperforms pixel-level generation for control. (StructVLA)
-
Wayve GAIA-2 (Mar 2025) — latent diffusion multi-view generative world model for AV. Commercial-scale proof that pixel-space world models are usable for simulation and data augmentation (not control) in autonomous driving. GAIA-3 launched in 2026 advancing "from simulation to evaluation." (GAIA-2)
The JEPA vs. Generative Schism (Robotics Lens)
The world-models field has split into two architectural camps that map cleanly onto robotics use cases:
| Camp | Architecture | Robotics Fit |
|---|---|---|
| JEPA (Meta FAIR, V-JEPA 2, LeWM) | Predict abstract representations, not pixels | Control — pick-and-place, manipulation, closed-loop behavior |
| Generative (DeepMind Genie, Wayve GAIA, Sora-family) | Predict pixels / latent pixels directly | Simulation — sim-to-real data, rare-event augmentation, counterfactual evaluation |
Physics benchmarks (PhyWorldBench, VideoScience-Bench) show generative video models fail systematically on rigid-body collisions, fluid dynamics, and multi-object scenes — tolerable for simulation data, catastrophic for control. This is the quantitative basis for the specialization prediction. See ai/wiki/concepts/generative-world-models.md for the empirical critique.
Key Claims
- V-JEPA 2-AC achieves zero-shot Franka pick-and-place after <62h of robot data on top of passive video pre-training. Strongest public evidence that SSL video models transfer to robotic control with minimal interaction data. Evidence: strong (V-JEPA 2)
- Hierarchical world models enable long-horizon TAMP — H-WM's symbolic + visual architecture reduces error accumulation versus single-level predictors. Evidence: moderate (H-WM)
- Sparse structured prediction beats dense pixel rollouts for manipulation planning (StructVLA). Supports the JEPA/abstract-prediction bet for control applications. Evidence: moderate (StructVLA)
- Visual perception world model enables autonomous learning — 1X's NEO world model allows the robot to observe and predict environment states, supporting self-directed skill acquisition. Evidence: strong (1X NEO World Model)
- Deliberate imperfection forces implicit world modeling — Isaac Sim 5.1 injects sensor noise and latency during training, requiring policies to develop internal robustness to real-world conditions. Evidence: strong (ABB/NVIDIA RobotStudio HyperReality)
- Internal and external world models are converging — Learned predictive models and high-fidelity simulations serve overlapping functions, with the boundary between them increasingly fluid. The research-lab approaches (V-JEPA 2, H-WM) and industrial approaches (1X, Isaac Sim) are still running on parallel tracks but will likely merge. Evidence: moderate
- LeCun's 2022 blueprint has aged well — the proposed architecture (JEPA + hierarchical planning + SSL) has produced working implementations (I-JEPA → V-JEPA → V-JEPA 2 → LeWM → C-JEPA) rather than remaining theoretical. Evidence: strong (LeCun 2022)
Open Questions
- Can world models scale to unstructured home environments where object types, arrangements, and interactions are highly variable?
- What are the real-time inference constraints — can world models predict fast enough for reactive manipulation tasks?
- How do world models handle novel objects and materials not seen during training?
- What is the right architecture for world models — JEPA (abstract), generative (pixel), or hybrid (LeWM, StructVLA)?
- Does V-JEPA 2-AC's tabletop success transfer to long-horizon manipulation at deployment scale?
- Can the industrial track (1X NEO, Isaac Sim) and the research track (V-JEPA 2, H-WM) merge, or will they stay on parallel tracks?
Related Concepts
- Sim-to-Real Transfer — External simulation as a form of world modeling
- Foundation Models for Robotics — Language models encode world knowledge that complements perceptual world models
Related Entities
- 1X Technologies — Visual perception world model for NEO
- NVIDIA — Isaac Sim as high-fidelity external world model
Backlinks
Pages that reference this concept:
Related Concepts
Theses that depend on this concept
These research positions cite this concept in their evidence. If the concept changes materially, these theses may need re-scoring.