LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
First JEPA training stably end-to-end from raw pixels using only two loss terms — removes EMA/distillation tricks earlier JEPAs required
LeWorldModel (LeWM): Stable End-to-End JEPA from Pixels
Key Claims
- First stable end-to-end JEPA from raw pixels — previous JEPAs (I-JEPA, V-JEPA) relied on tricks (EMA teacher, distillation, moving averages) to prevent representational collapse
- Only two loss terms — major simplification over the multi-objective stacks used in Dino-style SSL
- Open source — code at
lucas-maes/le-wm
Why This Matters
The representational collapse problem is the central technical challenge for JEPAs: without careful engineering, the encoder learns to output a constant (zero-information) representation that trivially satisfies the predictive loss. Every JEPA variant to date has been a different answer to "how do we prevent collapse?" LeWM is a compelling answer because it's the simplest — if this scales, it removes one of the biggest complaints about JEPA as a recipe.
Key research question going forward: does LeWM's simplicity hold at V-JEPA 2 scale, or does the simplification break down at billion-parameter / million-hour-video regimes?
Notes
Recent paper (March 2026). Deeper read deferred — frontier-tracker candidate.
Source: LeWorldModel by Lucas Maes et al.