Self-Supervised Learning (SSL)

Active Frontier

self-supervised-learningrepresentation-learningworld-modelsdata-efficiency

Self-Supervised Learning (SSL)

Self-supervised learning trains models on unlabeled data by having the model predict some part of the input from the rest — masked patches in images, future frames in video, next tokens in text. For world-model research, SSL is the learning paradigm that makes training possible at all: no one can label a million hours of video, but you can train a model to predict masked portions of that video.

The Bandwidth Argument

A central data point behind LeCun's argument for world models: a 4-year-old child has processed roughly 10^14 bytes of sensory data, the same order of magnitude as the largest LLM training corpora. But the child's data is high-bandwidth sensorimotor signal; the LLM's data is discrete text tokens that took humanity 400,000 years of reading-equivalent to produce. The bandwidth gap, not the total volume, is the argument for why text alone cannot deliver human-level understanding.

Entity	Data Source	Volume	Timeframe
Large Language Model	Public internet text	~10^14 bytes (30T tokens)	~400,000 yrs of reading
4-year-old child	Sensory data (vision, etc.)	~10^14 bytes	~16,000 wake hours

The conclusion: world models must learn from high-bandwidth sensory data (video, audio, tactile) via SSL to acquire physical common sense.

Key Claims

SSL on video is now producing deployable world models — V-JEPA 2's 1M+ hours of internet video pre-training yields models that transfer to robotic control with minimal interaction data. Evidence: strong (V-JEPA 2)
Biological systems learn intuitive physics via observation before language — infants show surprise at physics violations (e.g., a floating car) by ~10 months, well before language acquisition. Evidence: strong (developmental psychology)
SSL + small interaction dataset beats pure imitation learning for robotics — V-JEPA 2-AC uses <62h of robot video post-training, versus hundreds or thousands of hours typically required for imitation-learned policies. Evidence: moderate (V-JEPA 2)

Open Questions

What's the right SSL objective for each modality? Masked prediction, contrastive, JEPA, or diffusion-in-latent?
How do we combine multimodal SSL (video + audio + tactile + proprioception)?
At what point does SSL on observation saturate, requiring interaction data for further gains?

Related Concepts

World Models — SSL is the training paradigm
Joint Embedding Predictive Architecture (JEPA) — a specific SSL architecture
Chain-of-Thought Reasoning — LLM-side SSL on text; different bandwidth regime

Backlinks

Yann LeCun

Changelog

2026-04-22 — Initial compilation. Includes the bandwidth argument as the load-bearing claim.

Related Concepts

Joint Embedding Predictive Architecture (JEPA)

Active Frontier

world-modelsself-supervised-learningrepresentation-learning+1

World Models

Active Frontier

world-modelsembodied-aiself-supervised-learning+1

Sources

v-jepa-2 leworldmodel-stable-jepa-pixels world-models-survey-tsinghua

Self-Supervised Learning (SSL)

Active Frontier

self-supervised-learningrepresentation-learningworld-modelsdata-efficiency

Self-Supervised Learning (SSL)

The Bandwidth Argument

Entity	Data Source	Volume	Timeframe
Large Language Model	Public internet text	~10^14 bytes (30T tokens)	~400,000 yrs of reading
4-year-old child	Sensory data (vision, etc.)	~10^14 bytes	~16,000 wake hours

The conclusion: world models must learn from high-bandwidth sensory data (video, audio, tactile) via SSL to acquire physical common sense.

Key Claims

SSL on video is now producing deployable world models — V-JEPA 2's 1M+ hours of internet video pre-training yields models that transfer to robotic control with minimal interaction data. Evidence: strong (V-JEPA 2)
Biological systems learn intuitive physics via observation before language — infants show surprise at physics violations (e.g., a floating car) by ~10 months, well before language acquisition. Evidence: strong (developmental psychology)
SSL + small interaction dataset beats pure imitation learning for robotics — V-JEPA 2-AC uses <62h of robot video post-training, versus hundreds or thousands of hours typically required for imitation-learned policies. Evidence: moderate (V-JEPA 2)

Open Questions

What's the right SSL objective for each modality? Masked prediction, contrastive, JEPA, or diffusion-in-latent?
How do we combine multimodal SSL (video + audio + tactile + proprioception)?
At what point does SSL on observation saturate, requiring interaction data for further gains?

Related Concepts

World Models — SSL is the training paradigm
Joint Embedding Predictive Architecture (JEPA) — a specific SSL architecture
Chain-of-Thought Reasoning — LLM-side SSL on text; different bandwidth regime

Backlinks

Yann LeCun

Changelog

2026-04-22 — Initial compilation. Includes the bandwidth argument as the load-bearing claim.

Related Concepts

Joint Embedding Predictive Architecture (JEPA)

Active Frontier

world-modelsself-supervised-learningrepresentation-learning+1

World Models

Active Frontier

world-modelsembodied-aiself-supervised-learning+1

Sources

v-jepa-2 leworldmodel-stable-jepa-pixels world-models-survey-tsinghua