VideoScience-Bench: Benchmarking Scientific Understanding and Reasoning for Video Generation
Sora-2 ~64% / Veo-3 ~58.7% on Phenomenon Congruency — quantifies how far frontier video models are from ground-truth physical realism
VideoScience-Bench
Key Claims
- Sora-2 ≈ 64% Phenomenon Congruency, Veo-3 ≈ 58.7% on the headline Likert metric
- Scientific understanding and physical reasoning remain far from ground-truth performance even in frontier closed-source video generators
- Introduces a benchmark specifically designed around scientific phenomena (not general visual realism)
Why This Matters
Along with PhyWorldBench, this is the most recent empirical evidence that generative video models do not yet satisfy the criteria needed to serve as world models for embodied or scientific reasoning tasks. The metric — Phenomenon Congruency — is the thing the generative camp needs to win on if Sora/Veo/Genie are to credibly serve as world-model substrates for robotics or AV.
At 58-64% on a Likert scale, these models are better than random but far from ground truth. This is the quantitative version of LeCun's rhetorical "generative models produce blurry averages."
Key Data Point for the Schism
The pro-generative argument has been: "Scale solves everything; if we scale enough, physics will emerge." VideoScience-Bench says: at frontier 2025 scale, physics has not emerged. The JEPA camp's counterargument strengthens: maybe the representation-space bet is not just elegant, it's necessary — pixel-space models may have architectural limits on physical consistency that scaling alone cannot fix.
Notes
Published December 2025 — one of the most recent empirical anchors. Future compile passes should track whether 2026 releases close the gap or plateau.
Source: VideoScience-Bench