Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation (StructVLA)
COMPILED NOTES
Structured sparse frame prediction for planning — avoids dense pixel rollouts by predicting physically meaningful keyframes
StructVLA: Beyond Dense Futures
Key Claims
- Rejects dense pixel rollouts for planning — argues they are expensive and unnecessary
- Structured frame prediction — predicts sparse, physically meaningful keyframes derived from intrinsic kinematic cues rather than every frame
- Cast as a VLA (vision-language-action) extension that borrows world-model capabilities selectively
Why This Matters
Empirical support for LeCun's argument against pixel prediction: if you can plan at the level of structured, sparse, physically-meaningful frames, you get the benefits of world-model planning without the compute cost of generating every frame. Sits in the same conceptual family as JEPA (abstract representations > pixels) but applied within the VLA lineage.
Notes
Companion to H-WM. Together these two are the robotics-manipulation answer to "how does world-model theory cash out in deployed control?"
Source: StructVLA
RELATED · IN THE BASE