REPORT2025-03-27·Anthropic

Anthropic Transformer Circuits Thread & Circuit Tracing

Anthropic Interpretability Team (Elhage, Olsson, Bricken, Templeton, Ameisen, Lindsey et al.)

COMPILED NOTES

Running research thread: features, circuits, superposition, attribution graphs, circuit tracing tools

Anthropic Transformer Circuits Thread & Circuit Tracing

Overview

The Transformer Circuits Thread is Anthropic's ongoing research program in mechanistic interpretability, spanning 2021 to 2026. The thread progressively builds understanding of how transformer language models compute internally, moving from theoretical frameworks through empirical feature discovery to full computational graph tracing in production-scale models.

Foundational Work (2021-2022)

A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)

Established rigorous mathematical framework for reverse-engineering transformer models
Analyzed how attention heads compose to form computational circuits
Identified key motifs: induction heads, copy-suppression, skip-trigrams

In-Context Learning and Induction Heads (Olsson et al., 2022)

Discovered induction heads as the mechanistic basis for in-context learning
Demonstrated phase transitions in model training corresponding to circuit formation
Linked specific attention head patterns to macro-level learning behaviors

Toy Models of Superposition (Elhage et al., 2022)

Investigated how neural networks represent more features than they have dimensions
Demonstrated polysemanticity: individual neurons encoding multiple unrelated concepts
Established theoretical foundations for why interpretability is fundamentally difficult

Feature Discovery (2023-2024)

Towards Monosemanticity (Bricken et al., 2023)

Applied sparse autoencoders to extract interpretable features from transformer activations
Moved from polysemantic neurons to monosemantic features
Demonstrated features corresponding to recognizable concepts (code, languages, entities)

Scaling Monosemanticity (Templeton et al., 2024)

Extended sparse autoencoder techniques to Claude 3 Sonnet (production-scale model)
Extracted millions of interpretable features across diverse conceptual domains
Discovered specific features (e.g., Golden Gate Bridge, safety-relevant concepts) that could be artificially activated to steer model behavior

Circuit Tracing (2025-2026)

Circuit Tracing: Revealing Computational Graphs (Ameisen et al., 2025)

Developed attribution graphs revealing step-by-step computational paths in language models
Traced how models integrate attention patterns, feature interactions, and layer computations
Open-sourced circuit tracing tools for the research community

On the Biology of a Large Language Model (Lindsey et al., 2025)

Applied circuit tracing to study internal mechanisms of Claude 3.5 Haiku
Drew analogies between neural network circuits and biological neural systems
Revealed emergent organizational principles in trained language models

Emotion Concepts and their Function (Sofroniew et al., 2026)

Studied causal influence of emotion representation features on model behavior
Demonstrated that emotion features are not merely descriptive but functionally influence downstream computations
Bridged interpretability research with questions about model cognition

Core Concepts

Features

Interpretable directions in activation space corresponding to human-recognizable concepts. Extracted via sparse autoencoders. Individual features can be artificially activated (feature steering) or suppressed to modulate model behavior.

Circuits

Computational subgraphs within the model implementing specific behaviors. Composed of attention heads, MLP layers, and feature interactions. Can be traced and analyzed to understand how models produce specific outputs.

Superposition

The phenomenon where models compress more conceptual features than they have available dimensions, leading to polysemanticity. Superposition makes interpretability challenging because concepts share neural substrate.

Attribution Graphs

Directed graphs showing how information flows through a model to produce a specific output. Nodes represent features or attention heads; edges represent computational dependencies. Enable end-to-end tracing of model reasoning.

RELATED · IN THE BASE

REPORT2025-03-27·Anthropic

Anthropic Transformer Circuits Thread & Circuit Tracing

Anthropic Interpretability Team (Elhage, Olsson, Bricken, Templeton, Ameisen, Lindsey et al.)

COMPILED NOTES

Running research thread: features, circuits, superposition, attribution graphs, circuit tracing tools

Anthropic Transformer Circuits Thread & Circuit Tracing

Overview

Foundational Work (2021-2022)

A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)

Established rigorous mathematical framework for reverse-engineering transformer models
Analyzed how attention heads compose to form computational circuits
Identified key motifs: induction heads, copy-suppression, skip-trigrams

In-Context Learning and Induction Heads (Olsson et al., 2022)

Discovered induction heads as the mechanistic basis for in-context learning
Demonstrated phase transitions in model training corresponding to circuit formation
Linked specific attention head patterns to macro-level learning behaviors

Toy Models of Superposition (Elhage et al., 2022)

Investigated how neural networks represent more features than they have dimensions
Demonstrated polysemanticity: individual neurons encoding multiple unrelated concepts
Established theoretical foundations for why interpretability is fundamentally difficult

Feature Discovery (2023-2024)

Towards Monosemanticity (Bricken et al., 2023)

Applied sparse autoencoders to extract interpretable features from transformer activations
Moved from polysemantic neurons to monosemantic features
Demonstrated features corresponding to recognizable concepts (code, languages, entities)

Scaling Monosemanticity (Templeton et al., 2024)

Extended sparse autoencoder techniques to Claude 3 Sonnet (production-scale model)
Extracted millions of interpretable features across diverse conceptual domains
Discovered specific features (e.g., Golden Gate Bridge, safety-relevant concepts) that could be artificially activated to steer model behavior

Circuit Tracing (2025-2026)

Circuit Tracing: Revealing Computational Graphs (Ameisen et al., 2025)

Developed attribution graphs revealing step-by-step computational paths in language models
Traced how models integrate attention patterns, feature interactions, and layer computations
Open-sourced circuit tracing tools for the research community

On the Biology of a Large Language Model (Lindsey et al., 2025)

Applied circuit tracing to study internal mechanisms of Claude 3.5 Haiku
Drew analogies between neural network circuits and biological neural systems
Revealed emergent organizational principles in trained language models

Emotion Concepts and their Function (Sofroniew et al., 2026)

Studied causal influence of emotion representation features on model behavior
Demonstrated that emotion features are not merely descriptive but functionally influence downstream computations
Bridged interpretability research with questions about model cognition