Circuit Tracing

Active Frontier

interpretabilitycircuitsfeaturesattribution

Circuit Tracing

Circuit tracing is a technique for revealing the step-by-step computational paths inside transformer language models through attribution graphs — directed graphs where nodes represent features or attention heads and edges represent computational dependencies. Developed by Anthropic's interpretability team as part of their Transformer Circuits Thread (2021-2026), it builds on a progression of foundational work that moved from mathematical theory through empirical feature discovery to production-scale model analysis.

The research program advanced through distinct phases. The Mathematical Framework for Transformer Circuits (Elhage et al., 2021) established rigorous methods for reverse-engineering transformers, identifying key motifs like induction heads, copy-suppression, and skip-trigrams. Toy Models of Superposition (2022) revealed that neural networks represent more features than they have dimensions — individual neurons encode multiple unrelated concepts (polysemanticity), establishing why interpretability is fundamentally hard.

The breakthrough came with sparse autoencoders. Towards Monosemanticity (Bricken et al., 2023) applied them to extract interpretable features from transformer activations, moving from polysemantic neurons to monosemantic features corresponding to recognizable concepts. Scaling Monosemanticity (Templeton et al., 2024) extended this to Claude 3 Sonnet, extracting millions of interpretable features and discovering specific ones — like the Golden Gate Bridge feature — that could be artificially activated to steer model behavior.

The Circuit Tracing paper (Ameisen et al., 2025) brought these threads together: attribution graphs now trace how models integrate attention patterns, feature interactions, and layer computations from input to output. The tools were open-sourced for the research community. Subsequent work applied circuit tracing to study internal mechanisms of Claude 3.5 Haiku (drawing analogies with biological neural systems) and to investigate how emotion representation features causally influence downstream model behavior.

Key Claims

Attribution graphs reveal end-to-end computational paths in language models — Directed graphs trace information flow from input features through attention heads and MLP layers to output, enabling mechanistic understanding of specific model behaviors. Evidence: strong (Anthropic Transformer Circuits Thread)
Superposition makes interpretability fundamentally difficult — Models compress more features than they have dimensions, causing polysemanticity where individual neurons encode multiple unrelated concepts. Evidence: strong (Anthropic Transformer Circuits Thread)
Sparse autoencoders extract monosemantic features at production scale — Applied to Claude 3 Sonnet, yielding millions of interpretable features that can be steered to modify model behavior. Evidence: strong (Anthropic Transformer Circuits Thread)
Circuit tracing tools are open-sourced — Research community can apply attribution graph techniques to study their own models. Evidence: strong (Anthropic Transformer Circuits Thread)
Induction heads are the mechanistic basis for in-context learning — Specific attention head patterns correspond to the model's ability to learn from context, with phase transitions during training. Evidence: strong (Anthropic Transformer Circuits Thread)

Open Questions

Can circuit tracing scale to models with trillions of parameters?
How to trace circuits in multi-modal models (vision + language)?
Can attribution graphs detect alignment failures before they manifest in outputs?
What is the relationship between circuit complexity and model capability?
Can automated circuit analysis replace manual interpretability research?

Related Concepts

Mechanistic Interpretability — Circuit tracing is the primary technique advancing mechanistic interpretability
Chain-of-Thought Reasoning — Circuit tracing can reveal whether CoT reflects actual internal computation

Backlinks

Pages that reference this concept:

Related Concepts

Chain-of-Thought Reasoning

Active Frontier

reasoningchain-of-thoughtmonitoring

Mechanistic Interpretability

Active Frontier

interpretabilitysafetytransparency

Theses that depend on this concept

These research positions cite this concept in their evidence. If the concept changes materially, these theses may need re-scoring.

T2never reviewed

Mechanistic interpretability will fail to keep pace with model capabilities, creating a widening safety gap

6.0/10

no history yet

Test Your Understanding

AI Research Timeline

Order key breakthroughs in AI research from transformer circuits to agentic reasoning

Timeline·Intermediate·4m

Sources

anthropic-circuit-tracing

Circuit Tracing

Active Frontier

interpretabilitycircuitsfeaturesattribution

Circuit Tracing

Key Claims

Attribution graphs reveal end-to-end computational paths in language models — Directed graphs trace information flow from input features through attention heads and MLP layers to output, enabling mechanistic understanding of specific model behaviors. Evidence: strong (Anthropic Transformer Circuits Thread)
Superposition makes interpretability fundamentally difficult — Models compress more features than they have dimensions, causing polysemanticity where individual neurons encode multiple unrelated concepts. Evidence: strong (Anthropic Transformer Circuits Thread)
Sparse autoencoders extract monosemantic features at production scale — Applied to Claude 3 Sonnet, yielding millions of interpretable features that can be steered to modify model behavior. Evidence: strong (Anthropic Transformer Circuits Thread)
Circuit tracing tools are open-sourced — Research community can apply attribution graph techniques to study their own models. Evidence: strong (Anthropic Transformer Circuits Thread)
Induction heads are the mechanistic basis for in-context learning — Specific attention head patterns correspond to the model's ability to learn from context, with phase transitions during training. Evidence: strong (Anthropic Transformer Circuits Thread)

Open Questions

Can circuit tracing scale to models with trillions of parameters?
How to trace circuits in multi-modal models (vision + language)?
Can attribution graphs detect alignment failures before they manifest in outputs?
What is the relationship between circuit complexity and model capability?
Can automated circuit analysis replace manual interpretability research?

Related Concepts

Mechanistic Interpretability — Circuit tracing is the primary technique advancing mechanistic interpretability
Chain-of-Thought Reasoning — Circuit tracing can reveal whether CoT reflects actual internal computation