Understanding World or Predicting Future? A Comprehensive Survey of World Models
Two-function taxonomy separating world models that build internal representations (understanding) from those that predict future states (simulation/decision guidance); ACM CSUR extended version
Understanding World or Predicting Future? A Comprehensive Survey of World Models
Abstract
The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. The survey systematically categorizes world models into two primary functions: (1) constructing internal representations to understand world mechanisms, and (2) predicting future states to simulate and guide decision-making. Applications are explored across generative games, autonomous driving, robotics, and social simulacra.
Key Contributions
- Two-function taxonomy — separates "understanding" world models (internal representation of mechanisms) from "prediction" world models (future-state simulation)
- Cross-domain review — systematic coverage of generative games, autonomous driving, robotics, and social simulacra applications
- Challenges and future directions — identifies open problems across both functional categories
- Curated bibliography — companion GitHub repo at
tsinghua-fib-lab/World-Model
Methodology
Two-dimensional classification:
- Function axis: understanding (encoder-centric) vs. prediction (simulator-centric)
- Application axis: generative games / autonomous driving / robotics / social simulacra
Each application domain is analyzed for how the two functions are instantiated (e.g., driving world models lean prediction-heavy for trajectory forecasting; game world models lean understanding-heavy for physics consistency).
Results
Survey rather than empirical. Key comparative observation: generative video models (Sora-class) and JEPA-style encoders represent fundamentally different bets on whether the "world model" is a pixel-space simulator or a representation-space predictor — the survey treats both as valid instantiations of the broader concept.
Limitations
- Published November 2024, revised through December 2025 — some 2026 developments (V-JEPA 2, Genie 3 production release) post-date even the revised version
- Taxonomy is descriptive rather than predictive about which approach wins
Full Content
49 pages, 6 figures, 8 tables. Extended version of the ACM CSUR paper. Creative Commons BY 4.0.
Source: Understanding World or Predicting Future? A Comprehensive Survey of World Models by Jingtao Ding et al., Tsinghua University