PAPER2025-12-03·Multiple·arXiv 2512.02942

VideoScience-Bench: Benchmarking Scientific Understanding and Reasoning for Video Generation

Multiple authors

COMPILED NOTES

Sora-2 ~64% / Veo-3 ~58.7% on Phenomenon Congruency — quantifies how far frontier video models are from ground-truth physical realism

VideoScience-Bench

Key Claims

Sora-2 ≈ 64% Phenomenon Congruency, Veo-3 ≈ 58.7% on the headline Likert metric
Scientific understanding and physical reasoning remain far from ground-truth performance even in frontier closed-source video generators
Introduces a benchmark specifically designed around scientific phenomena (not general visual realism)

Why This Matters

Along with PhyWorldBench, this is the most recent empirical evidence that generative video models do not yet satisfy the criteria needed to serve as world models for embodied or scientific reasoning tasks. The metric — Phenomenon Congruency — is the thing the generative camp needs to win on if Sora/Veo/Genie are to credibly serve as world-model substrates for robotics or AV.

At 58-64% on a Likert scale, these models are better than random but far from ground truth. This is the quantitative version of LeCun's rhetorical "generative models produce blurry averages."

Key Data Point for the Schism

The pro-generative argument has been: "Scale solves everything; if we scale enough, physics will emerge." VideoScience-Bench says: at frontier 2025 scale, physics has not emerged. The JEPA camp's counterargument strengthens: maybe the representation-space bet is not just elegant, it's necessary — pixel-space models may have architectural limits on physical consistency that scaling alone cannot fix.

Notes

Published December 2025 — one of the most recent empirical anchors. Future compile passes should track whether 2026 releases close the gap or plateau.

Source: VideoScience-Bench

RELATED · IN THE BASE

PAPER2025-12-03·Multiple·arXiv 2512.02942

VideoScience-Bench: Benchmarking Scientific Understanding and Reasoning for Video Generation

Multiple authors

COMPILED NOTES

Sora-2 ~64% / Veo-3 ~58.7% on Phenomenon Congruency — quantifies how far frontier video models are from ground-truth physical realism

VideoScience-Bench

Key Claims

Sora-2 ≈ 64% Phenomenon Congruency, Veo-3 ≈ 58.7% on the headline Likert metric
Scientific understanding and physical reasoning remain far from ground-truth performance even in frontier closed-source video generators
Introduces a benchmark specifically designed around scientific phenomena (not general visual realism)

Why This Matters

At 58-64% on a Likert scale, these models are better than random but far from ground truth. This is the quantitative version of LeCun's rhetorical "generative models produce blurry averages."

Key Data Point for the Schism

Notes

Published December 2025 — one of the most recent empirical anchors. Future compile passes should track whether 2026 releases close the gap or plateau.

Source: VideoScience-Bench

RELATED · IN THE BASE