PAPER2025-10-19·Multiple·arXiv 2510.16732

A Comprehensive Survey on World Models for Embodied AI

Xinqing Li et al.

COMPILED NOTES

Three-axis taxonomy (Functionality × Temporal × Spatial) for embodied AI world models

A Comprehensive Survey on World Models for Embodied AI

Abstract

Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states. The survey presents a unified framework for world models, formalizing problem settings and learning objectives while proposing a three-axis taxonomy for classification.

Three-Axis Taxonomy

Axis 1 — Functionality:

Decision-Coupled (world model trained jointly with a policy, optimized for control)
General-Purpose (world model learned independently, reusable across tasks)

Axis 2 — Temporal Modeling:

Sequential Simulation and Inference (step-by-step autoregressive rollout)
Global Difference Prediction (predict changes between distant states)

Axis 3 — Spatial Representation:

Global Latent Vector (single-vector world state)
Token Feature Sequence (transformer-style tokenized world)
Spatial Latent Grid (2D/3D latent grids preserving spatial structure)
Decomposed Rendering Representation (Gaussian splats, neural fields, etc.)

Key Contributions

Unified framework covering robotics, autonomous driving, and video world models
Dataset and evaluation-metric systematization across pixel quality, state-level understanding, and task performance
Identifies "physical consistency metrics over pixel fidelity" as an open problem
Curated bibliography at Li-Zn-H/AwesomeWorldModels

Main Conclusions

Critical challenges across embodied AI world models:

Lack of unified datasets
Need for physical consistency metrics over pixel fidelity
Balancing model performance with computational efficiency for real-time control
Long-horizon temporal consistency without error accumulation

Why This Matters

Complements the Tsinghua survey by focusing specifically on embodied applications. The three-axis taxonomy is a useful filter when comparing papers — e.g., V-JEPA 2 is Decision-Coupled × Sequential × Global Latent Vector, Genie 3 is General-Purpose × Sequential × Token Feature Sequence.

Source: A Comprehensive Survey on World Models for Embodied AI by Li et al.

RELATED · IN THE BASE

PAPER2025-10-19·Multiple·arXiv 2510.16732

A Comprehensive Survey on World Models for Embodied AI

Xinqing Li et al.

COMPILED NOTES

Three-axis taxonomy (Functionality × Temporal × Spatial) for embodied AI world models

A Comprehensive Survey on World Models for Embodied AI

Abstract

Three-Axis Taxonomy

Axis 1 — Functionality:

Decision-Coupled (world model trained jointly with a policy, optimized for control)
General-Purpose (world model learned independently, reusable across tasks)

Axis 2 — Temporal Modeling:

Sequential Simulation and Inference (step-by-step autoregressive rollout)
Global Difference Prediction (predict changes between distant states)

Axis 3 — Spatial Representation:

Global Latent Vector (single-vector world state)
Token Feature Sequence (transformer-style tokenized world)
Spatial Latent Grid (2D/3D latent grids preserving spatial structure)
Decomposed Rendering Representation (Gaussian splats, neural fields, etc.)

Key Contributions

Unified framework covering robotics, autonomous driving, and video world models
Dataset and evaluation-metric systematization across pixel quality, state-level understanding, and task performance
Identifies "physical consistency metrics over pixel fidelity" as an open problem
Curated bibliography at Li-Zn-H/AwesomeWorldModels

Main Conclusions

Critical challenges across embodied AI world models:

Lack of unified datasets
Need for physical consistency metrics over pixel fidelity
Balancing model performance with computational efficiency for real-time control
Long-horizon temporal consistency without error accumulation

Why This Matters

Source: A Comprehensive Survey on World Models for Embodied AI by Li et al.

RELATED · IN THE BASE