Circuit Tracing
Active FrontierCircuit Tracing
Circuit tracing is a technique for revealing the step-by-step computational paths inside transformer language models through attribution graphs — directed graphs where nodes represent features or attention heads and edges represent computational dependencies. Developed by Anthropic's interpretability team as part of their Transformer Circuits Thread (2021-2026), it builds on a progression of foundational work that moved from mathematical theory through empirical feature discovery to production-scale model analysis.
The research program advanced through distinct phases. The Mathematical Framework for Transformer Circuits (Elhage et al., 2021) established rigorous methods for reverse-engineering transformers, identifying key motifs like induction heads, copy-suppression, and skip-trigrams. Toy Models of Superposition (2022) revealed that neural networks represent more features than they have dimensions — individual neurons encode multiple unrelated concepts (polysemanticity), establishing why interpretability is fundamentally hard.
The breakthrough came with sparse autoencoders. Towards Monosemanticity (Bricken et al., 2023) applied them to extract interpretable features from transformer activations, moving from polysemantic neurons to monosemantic features corresponding to recognizable concepts. Scaling Monosemanticity (Templeton et al., 2024) extended this to Claude 3 Sonnet, extracting millions of interpretable features and discovering specific ones — like the Golden Gate Bridge feature — that could be artificially activated to steer model behavior.
The Circuit Tracing paper (Ameisen et al., 2025) brought these threads together: attribution graphs now trace how models integrate attention patterns, feature interactions, and layer computations from input to output. The tools were open-sourced for the research community. Subsequent work applied circuit tracing to study internal mechanisms of Claude 3.5 Haiku (drawing analogies with biological neural systems) and to investigate how emotion representation features causally influence downstream model behavior.
Key Claims
- Attribution graphs reveal end-to-end computational paths in language models — Directed graphs trace information flow from input features through attention heads and MLP layers to output, enabling mechanistic understanding of specific model behaviors. Evidence: strong (Anthropic Transformer Circuits Thread)
- Superposition makes interpretability fundamentally difficult — Models compress more features than they have dimensions, causing polysemanticity where individual neurons encode multiple unrelated concepts. Evidence: strong (Anthropic Transformer Circuits Thread)
- Sparse autoencoders extract monosemantic features at production scale — Applied to Claude 3 Sonnet, yielding millions of interpretable features that can be steered to modify model behavior. Evidence: strong (Anthropic Transformer Circuits Thread)
- Circuit tracing tools are open-sourced — Research community can apply attribution graph techniques to study their own models. Evidence: strong (Anthropic Transformer Circuits Thread)
- Induction heads are the mechanistic basis for in-context learning — Specific attention head patterns correspond to the model's ability to learn from context, with phase transitions during training. Evidence: strong (Anthropic Transformer Circuits Thread)
Open Questions
- Can circuit tracing scale to models with trillions of parameters?
- How to trace circuits in multi-modal models (vision + language)?
- Can attribution graphs detect alignment failures before they manifest in outputs?
- What is the relationship between circuit complexity and model capability?
- Can automated circuit analysis replace manual interpretability research?
Related Concepts
- Mechanistic Interpretability — Circuit tracing is the primary technique advancing mechanistic interpretability
- Chain-of-Thought Reasoning — Circuit tracing can reveal whether CoT reflects actual internal computation
Backlinks
Pages that reference this concept: