PAPER2025-07-17·Multiple·arXiv 2507.13428

PhyWorldBench: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Multiple authors

COMPILED NOTES

12,600-video empirical benchmark — quantifies systematic physics failures in Sora and peer generative video models

PhyWorldBench

Key Claims

12,600 generated videos evaluated across state-of-the-art text-to-video models
Systematic physics failures — models frequently violate rigid-body collisions, fluid dynamics, and simple gravity
Multi-object prompts break generative world models — Sora and similar models fail more frequently as prompt complexity increases
Benchmark organizes failures by physics category, difficulty, and scenario type

Why This Matters

This is the strongest empirical support for LeCun's critique of generative world models in the KB so far. The critique has been theoretical ("pixel prediction produces blurry averages of unpredictable futures"); PhyWorldBench converts that into quantitative failure rates. If Genie 3, Sora, and similar generative systems are supposed to function as world models (i.e., support planning, counterfactual reasoning, embodied control), their systematic physics failures undermine that claim.

Companion datapoint from VideoScience-Bench (Dec 2025): Sora-2 scores ~64% on "Phenomenon Congruency" on a Likert scale; Veo-3 scores ~58.7%. Even the strongest closed-source systems are far from ground-truth physical realism.

Positioning in the Debate

Undermines the generative camp's strongest claim that pixel-space video models can serve as world models
Supports Thesis 4 — specifically the prediction that generative models will specialize to simulation/content and JEPA-style will dominate control, because physics failures are catastrophic for control but tolerable for entertainment
Opens a new open problem — can the generative camp close the physics gap via architectural changes, or does this cap their utility as planning substrates?

Notes

Part of a growing 2025-2026 benchmark family: T2VPhysBench (2025), PhyWorldBench (2025), VideoScience-Bench (Dec 2025), Physion-Eval (Mar 2026), PhyAVBench (Dec 2025). The benchmark infrastructure for empirical critique of generative world models matured quickly in the last 12 months.

Source: PhyWorldBench

RELATED · IN THE BASE

PAPER2025-07-17·Multiple·arXiv 2507.13428

PhyWorldBench: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Multiple authors

COMPILED NOTES

12,600-video empirical benchmark — quantifies systematic physics failures in Sora and peer generative video models

PhyWorldBench

Key Claims

12,600 generated videos evaluated across state-of-the-art text-to-video models
Systematic physics failures — models frequently violate rigid-body collisions, fluid dynamics, and simple gravity
Multi-object prompts break generative world models — Sora and similar models fail more frequently as prompt complexity increases
Benchmark organizes failures by physics category, difficulty, and scenario type

Why This Matters

Positioning in the Debate

Undermines the generative camp's strongest claim that pixel-space video models can serve as world models
Supports Thesis 4 — specifically the prediction that generative models will specialize to simulation/content and JEPA-style will dominate control, because physics failures are catastrophic for control but tolerable for entertainment
Opens a new open problem — can the generative camp close the physics gap via architectural changes, or does this cap their utility as planning substrates?

Notes

Source: PhyWorldBench

RELATED · IN THE BASE