T4Bnever reviewed

JEPA and generative world models will specialize to different use cases (control vs. simulation), not converge on a single architecture

Conviction

6.0/10

Trajectory

no history yet

Last reviewed

—

Split from original Thesis 4 on 2026-04-22. This is the more opinionated, higher-risk half.

The JEPA camp (Meta FAIR) and the generative camp (DeepMind Genie, Sora-family, Wayve GAIA) are architecturally incompatible in ways that map onto different commercial use cases. JEPA's abstract-representation prediction is efficient for planning and control; generative models' pixel-space output is native to simulation, content creation, and data augmentation. The split will persist rather than resolve: each approach will specialize toward where its failure mode is tolerable (pixel failures are funny in video, catastrophic in robots; missing visual detail is acceptable for planning, unacceptable for entertainment).

Confidence: 6/10 Supporting evidence:

StructVLA explicitly argues against dense pixel rollouts for planning — sparse structured keyframes win for control Evidence: moderate (StructVLA)
PhyWorldBench + VideoScience-Bench quantify the systematic physics failures in generative models (Sora-2 ~64%, Veo-3 ~58.7%) — these are tolerable for entertainment but catastrophic for embodied control Evidence: strong (PhyWorldBench, VideoScience-Bench)
GAIA-2's deployment at Wayve is explicitly for simulation and data augmentation, not end-to-end control Evidence: strong (GAIA-2)
V-JEPA 2-AC's deployment is explicitly for robotic control, not content generation Evidence: strong (V-JEPA 2)

Challenging evidence:

LeWM is the strongest counter-signal — if stable end-to-end pixel JEPA scales, the JEPA/generative distinction partially collapses because LeWM is architecturally both Evidence: moderate (LeWM)
GAIA-2 uses latent diffusion — i.e., prediction in learned latent space, not naive pixel space. The "generative = pixel-space" caricature is already inaccurate
StructVLA straddles the boundary — called a VLA extension that borrows world-model capabilities; the "two camps" framing is rhetorical, the research is hybridizing
Diffusion in latent space (broadly) is a plausible unifying architecture that could subsume both camps if one lab executes well

Evolution:

Apr 22, 2026 — Thesis 4b split from original compound Thesis 4 at 6/10. The debate exposed this as the opinionated prediction the compound version was papering over. 6/10 honestly reports that specialization is likely but not more likely than architectural convergence — the hybrids (LeWM, GAIA-2, StructVLA) are already blurring the line.

Depends on: joint-embedding-predictive-architecture, generative-world-models, hierarchical-planning Would change if:

A unified architecture (e.g., scaled-up LeWM, or large-scale diffusion-in-latent-space JEPA) matches or beats both camps on both control and simulation by end-2028 — would lower to 3/10
Generative models close the physics gap (>85% on PhyWorldBench-equivalent metrics) and take over robotic control use cases — would lower to 4/10 and invert the thesis
JEPA extends to high-quality video generation (not just planning) at competitive fidelity to Genie 3 — would lower to 4/10