Anthropic Transformer Circuits Thread & Circuit Tracing

Tech Report
Anthropic Interpretability Team (Elhage, Olsson, Bricken, Templeton, Ameisen, Lindsey et al.)AnthropicMarch 27, 2025
Original Source
Key Contribution

Running research thread: features, circuits, superposition, attribution graphs, circuit tracing tools

Anthropic Transformer Circuits Thread & Circuit Tracing

Overview

The Transformer Circuits Thread is Anthropic's ongoing research program in mechanistic interpretability, spanning 2021 to 2026. The thread progressively builds understanding of how transformer language models compute internally, moving from theoretical frameworks through empirical feature discovery to full computational graph tracing in production-scale models.

Foundational Work (2021-2022)

A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)

  • Established rigorous mathematical framework for reverse-engineering transformer models
  • Analyzed how attention heads compose to form computational circuits
  • Identified key motifs: induction heads, copy-suppression, skip-trigrams

In-Context Learning and Induction Heads (Olsson et al., 2022)

  • Discovered induction heads as the mechanistic basis for in-context learning
  • Demonstrated phase transitions in model training corresponding to circuit formation
  • Linked specific attention head patterns to macro-level learning behaviors

Toy Models of Superposition (Elhage et al., 2022)

  • Investigated how neural networks represent more features than they have dimensions
  • Demonstrated polysemanticity: individual neurons encoding multiple unrelated concepts
  • Established theoretical foundations for why interpretability is fundamentally difficult

Feature Discovery (2023-2024)

Towards Monosemanticity (Bricken et al., 2023)

  • Applied sparse autoencoders to extract interpretable features from transformer activations
  • Moved from polysemantic neurons to monosemantic features
  • Demonstrated features corresponding to recognizable concepts (code, languages, entities)

Scaling Monosemanticity (Templeton et al., 2024)

  • Extended sparse autoencoder techniques to Claude 3 Sonnet (production-scale model)
  • Extracted millions of interpretable features across diverse conceptual domains
  • Discovered specific features (e.g., Golden Gate Bridge, safety-relevant concepts) that could be artificially activated to steer model behavior

Circuit Tracing (2025-2026)

Circuit Tracing: Revealing Computational Graphs (Ameisen et al., 2025)

  • Developed attribution graphs revealing step-by-step computational paths in language models
  • Traced how models integrate attention patterns, feature interactions, and layer computations
  • Open-sourced circuit tracing tools for the research community

On the Biology of a Large Language Model (Lindsey et al., 2025)

  • Applied circuit tracing to study internal mechanisms of Claude 3.5 Haiku
  • Drew analogies between neural network circuits and biological neural systems
  • Revealed emergent organizational principles in trained language models

Emotion Concepts and their Function (Sofroniew et al., 2026)

  • Studied causal influence of emotion representation features on model behavior
  • Demonstrated that emotion features are not merely descriptive but functionally influence downstream computations
  • Bridged interpretability research with questions about model cognition

Core Concepts

Features

Interpretable directions in activation space corresponding to human-recognizable concepts. Extracted via sparse autoencoders. Individual features can be artificially activated (feature steering) or suppressed to modulate model behavior.

Circuits

Computational subgraphs within the model implementing specific behaviors. Composed of attention heads, MLP layers, and feature interactions. Can be traced and analyzed to understand how models produce specific outputs.

Superposition

The phenomenon where models compress more conceptual features than they have available dimensions, leading to polysemanticity. Superposition makes interpretability challenging because concepts share neural substrate.

Attribution Graphs

Directed graphs showing how information flows through a model to produce a specific output. Nodes represent features or attention heads; edges represent computational dependencies. Enable end-to-end tracing of model reasoning.

Tags

mechanistic-interpretabilitycircuit-tracinganthropicattribution-graphs
Anthropic Transformer Circuits Thread & Circuit Tracing | KB | MenFem