PhyWorldBench: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models
12,600-video empirical benchmark — quantifies systematic physics failures in Sora and peer generative video models
PhyWorldBench
Key Claims
- 12,600 generated videos evaluated across state-of-the-art text-to-video models
- Systematic physics failures — models frequently violate rigid-body collisions, fluid dynamics, and simple gravity
- Multi-object prompts break generative world models — Sora and similar models fail more frequently as prompt complexity increases
- Benchmark organizes failures by physics category, difficulty, and scenario type
Why This Matters
This is the strongest empirical support for LeCun's critique of generative world models in the KB so far. The critique has been theoretical ("pixel prediction produces blurry averages of unpredictable futures"); PhyWorldBench converts that into quantitative failure rates. If Genie 3, Sora, and similar generative systems are supposed to function as world models (i.e., support planning, counterfactual reasoning, embodied control), their systematic physics failures undermine that claim.
Companion datapoint from VideoScience-Bench (Dec 2025): Sora-2 scores ~64% on "Phenomenon Congruency" on a Likert scale; Veo-3 scores ~58.7%. Even the strongest closed-source systems are far from ground-truth physical realism.
Positioning in the Debate
- Undermines the generative camp's strongest claim that pixel-space video models can serve as world models
- Supports Thesis 4 — specifically the prediction that generative models will specialize to simulation/content and JEPA-style will dominate control, because physics failures are catastrophic for control but tolerable for entertainment
- Opens a new open problem — can the generative camp close the physics gap via architectural changes, or does this cap their utility as planning substrates?
Notes
Part of a growing 2025-2026 benchmark family: T2VPhysBench (2025), PhyWorldBench (2025), VideoScience-Bench (Dec 2025), Physion-Eval (Mar 2026), PhyAVBench (Dec 2025). The benchmark infrastructure for empirical critique of generative world models matured quickly in the last 12 months.
Source: PhyWorldBench