Self-Supervised Learning (SSL)
Active FrontierSelf-Supervised Learning (SSL)
Self-supervised learning trains models on unlabeled data by having the model predict some part of the input from the rest — masked patches in images, future frames in video, next tokens in text. For world-model research, SSL is the learning paradigm that makes training possible at all: no one can label a million hours of video, but you can train a model to predict masked portions of that video.
The Bandwidth Argument
A central data point behind LeCun's argument for world models: a 4-year-old child has processed roughly 10^14 bytes of sensory data, the same order of magnitude as the largest LLM training corpora. But the child's data is high-bandwidth sensorimotor signal; the LLM's data is discrete text tokens that took humanity 400,000 years of reading-equivalent to produce. The bandwidth gap, not the total volume, is the argument for why text alone cannot deliver human-level understanding.
| Entity | Data Source | Volume | Timeframe |
|---|---|---|---|
| Large Language Model | Public internet text | ~10^14 bytes (30T tokens) | ~400,000 yrs of reading |
| 4-year-old child | Sensory data (vision, etc.) | ~10^14 bytes | ~16,000 wake hours |
The conclusion: world models must learn from high-bandwidth sensory data (video, audio, tactile) via SSL to acquire physical common sense.
Key Claims
-
SSL on video is now producing deployable world models — V-JEPA 2's 1M+ hours of internet video pre-training yields models that transfer to robotic control with minimal interaction data. Evidence: strong (V-JEPA 2)
-
Biological systems learn intuitive physics via observation before language — infants show surprise at physics violations (e.g., a floating car) by ~10 months, well before language acquisition. Evidence: strong (developmental psychology)
-
SSL + small interaction dataset beats pure imitation learning for robotics — V-JEPA 2-AC uses <62h of robot video post-training, versus hundreds or thousands of hours typically required for imitation-learned policies. Evidence: moderate (V-JEPA 2)
Open Questions
- What's the right SSL objective for each modality? Masked prediction, contrastive, JEPA, or diffusion-in-latent?
- How do we combine multimodal SSL (video + audio + tactile + proprioception)?
- At what point does SSL on observation saturate, requiring interaction data for further gains?
Related Concepts
- World Models — SSL is the training paradigm
- Joint Embedding Predictive Architecture (JEPA) — a specific SSL architecture
- Chain-of-Thought Reasoning — LLM-side SSL on text; different bandwidth regime
Backlinks
Changelog
- 2026-04-22 — Initial compilation. Includes the bandwidth argument as the load-bearing claim.