Anthropic Transformer Circuits Thread & Circuit Tracing
Tech ReportRunning research thread: features, circuits, superposition, attribution graphs, circuit tracing tools
Anthropic Transformer Circuits Thread & Circuit Tracing
Overview
The Transformer Circuits Thread is Anthropic's ongoing research program in mechanistic interpretability, spanning 2021 to 2026. The thread progressively builds understanding of how transformer language models compute internally, moving from theoretical frameworks through empirical feature discovery to full computational graph tracing in production-scale models.
Foundational Work (2021-2022)
A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)
- Established rigorous mathematical framework for reverse-engineering transformer models
- Analyzed how attention heads compose to form computational circuits
- Identified key motifs: induction heads, copy-suppression, skip-trigrams
In-Context Learning and Induction Heads (Olsson et al., 2022)
- Discovered induction heads as the mechanistic basis for in-context learning
- Demonstrated phase transitions in model training corresponding to circuit formation
- Linked specific attention head patterns to macro-level learning behaviors
Toy Models of Superposition (Elhage et al., 2022)
- Investigated how neural networks represent more features than they have dimensions
- Demonstrated polysemanticity: individual neurons encoding multiple unrelated concepts
- Established theoretical foundations for why interpretability is fundamentally difficult
Feature Discovery (2023-2024)
Towards Monosemanticity (Bricken et al., 2023)
- Applied sparse autoencoders to extract interpretable features from transformer activations
- Moved from polysemantic neurons to monosemantic features
- Demonstrated features corresponding to recognizable concepts (code, languages, entities)
Scaling Monosemanticity (Templeton et al., 2024)
- Extended sparse autoencoder techniques to Claude 3 Sonnet (production-scale model)
- Extracted millions of interpretable features across diverse conceptual domains
- Discovered specific features (e.g., Golden Gate Bridge, safety-relevant concepts) that could be artificially activated to steer model behavior
Circuit Tracing (2025-2026)
Circuit Tracing: Revealing Computational Graphs (Ameisen et al., 2025)
- Developed attribution graphs revealing step-by-step computational paths in language models
- Traced how models integrate attention patterns, feature interactions, and layer computations
- Open-sourced circuit tracing tools for the research community
On the Biology of a Large Language Model (Lindsey et al., 2025)
- Applied circuit tracing to study internal mechanisms of Claude 3.5 Haiku
- Drew analogies between neural network circuits and biological neural systems
- Revealed emergent organizational principles in trained language models
Emotion Concepts and their Function (Sofroniew et al., 2026)
- Studied causal influence of emotion representation features on model behavior
- Demonstrated that emotion features are not merely descriptive but functionally influence downstream computations
- Bridged interpretability research with questions about model cognition
Core Concepts
Features
Interpretable directions in activation space corresponding to human-recognizable concepts. Extracted via sparse autoencoders. Individual features can be artificially activated (feature steering) or suppressed to modulate model behavior.
Circuits
Computational subgraphs within the model implementing specific behaviors. Composed of attention heads, MLP layers, and feature interactions. Can be traced and analyzed to understand how models produce specific outputs.
Superposition
The phenomenon where models compress more conceptual features than they have available dimensions, leading to polysemanticity. Superposition makes interpretability challenging because concepts share neural substrate.
Attribution Graphs
Directed graphs showing how information flows through a model to produce a specific output. Nodes represent features or attention heads; edges represent computational dependencies. Enable end-to-end tracing of model reasoning.