PAPER2026-03-13·Multiple·arXiv 2603.12553

Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation (StructVLA)

Multiple authors

COMPILED NOTES

Structured sparse frame prediction for planning — avoids dense pixel rollouts by predicting physically meaningful keyframes

StructVLA: Beyond Dense Futures

Key Claims

Rejects dense pixel rollouts for planning — argues they are expensive and unnecessary
Structured frame prediction — predicts sparse, physically meaningful keyframes derived from intrinsic kinematic cues rather than every frame
Cast as a VLA (vision-language-action) extension that borrows world-model capabilities selectively

Why This Matters

Empirical support for LeCun's argument against pixel prediction: if you can plan at the level of structured, sparse, physically-meaningful frames, you get the benefits of world-model planning without the compute cost of generating every frame. Sits in the same conceptual family as JEPA (abstract representations > pixels) but applied within the VLA lineage.

Notes

Companion to H-WM. Together these two are the robotics-manipulation answer to "how does world-model theory cash out in deployed control?"

Source: StructVLA

RELATED · IN THE BASE