PAPER2024-11-21·Tsinghua University (FIB Lab)·arXiv 2411.14499

Understanding World or Predicting Future? A Comprehensive Survey of World Models

Jingtao Ding et al.

COMPILED NOTES

Two-function taxonomy separating world models that build internal representations (understanding) from those that predict future states (simulation/decision guidance); ACM CSUR extended version

Understanding World or Predicting Future? A Comprehensive Survey of World Models

Abstract

The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. The survey systematically categorizes world models into two primary functions: (1) constructing internal representations to understand world mechanisms, and (2) predicting future states to simulate and guide decision-making. Applications are explored across generative games, autonomous driving, robotics, and social simulacra.

Key Contributions

Two-function taxonomy — separates "understanding" world models (internal representation of mechanisms) from "prediction" world models (future-state simulation)
Cross-domain review — systematic coverage of generative games, autonomous driving, robotics, and social simulacra applications
Challenges and future directions — identifies open problems across both functional categories
Curated bibliography — companion GitHub repo at tsinghua-fib-lab/World-Model

Methodology

Two-dimensional classification:

Function axis: understanding (encoder-centric) vs. prediction (simulator-centric)
Application axis: generative games / autonomous driving / robotics / social simulacra

Each application domain is analyzed for how the two functions are instantiated (e.g., driving world models lean prediction-heavy for trajectory forecasting; game world models lean understanding-heavy for physics consistency).

Results

Survey rather than empirical. Key comparative observation: generative video models (Sora-class) and JEPA-style encoders represent fundamentally different bets on whether the "world model" is a pixel-space simulator or a representation-space predictor — the survey treats both as valid instantiations of the broader concept.

Limitations

Published November 2024, revised through December 2025 — some 2026 developments (V-JEPA 2, Genie 3 production release) post-date even the revised version
Taxonomy is descriptive rather than predictive about which approach wins

Full Content

49 pages, 6 figures, 8 tables. Extended version of the ACM CSUR paper. Creative Commons BY 4.0.

Source: Understanding World or Predicting Future? A Comprehensive Survey of World Models by Jingtao Ding et al., Tsinghua University

RELATED · IN THE BASE