A Comprehensive Survey on World Models for Embodied AI
Three-axis taxonomy (Functionality × Temporal × Spatial) for embodied AI world models
A Comprehensive Survey on World Models for Embodied AI
Abstract
Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states. The survey presents a unified framework for world models, formalizing problem settings and learning objectives while proposing a three-axis taxonomy for classification.
Three-Axis Taxonomy
Axis 1 — Functionality:
- Decision-Coupled (world model trained jointly with a policy, optimized for control)
- General-Purpose (world model learned independently, reusable across tasks)
Axis 2 — Temporal Modeling:
- Sequential Simulation and Inference (step-by-step autoregressive rollout)
- Global Difference Prediction (predict changes between distant states)
Axis 3 — Spatial Representation:
- Global Latent Vector (single-vector world state)
- Token Feature Sequence (transformer-style tokenized world)
- Spatial Latent Grid (2D/3D latent grids preserving spatial structure)
- Decomposed Rendering Representation (Gaussian splats, neural fields, etc.)
Key Contributions
- Unified framework covering robotics, autonomous driving, and video world models
- Dataset and evaluation-metric systematization across pixel quality, state-level understanding, and task performance
- Identifies "physical consistency metrics over pixel fidelity" as an open problem
- Curated bibliography at
Li-Zn-H/AwesomeWorldModels
Main Conclusions
Critical challenges across embodied AI world models:
- Lack of unified datasets
- Need for physical consistency metrics over pixel fidelity
- Balancing model performance with computational efficiency for real-time control
- Long-horizon temporal consistency without error accumulation
Why This Matters
Complements the Tsinghua survey by focusing specifically on embodied applications. The three-axis taxonomy is a useful filter when comparing papers — e.g., V-JEPA 2 is Decision-Coupled × Sequential × Global Latent Vector, Genie 3 is General-Purpose × Sequential × Token Feature Sequence.
Source: A Comprehensive Survey on World Models for Embodied AI by Li et al.