Anthropic

lab
ai-labsafetyinterpretability

Anthropic

Type: AI Safety Research Lab

Anthropic is an AI safety company known for Claude, their large language model. In this knowledge base, they are represented through their pioneering work on mechanistic interpretability and circuit tracing — the effort to understand how AI models produce their outputs at a mechanistic level.

Anthropic developed a "microscope" that can identify features inside Claude and trace sequences of features from prompt to response, revealing how the model reasons. This progressed from identifying individual features (2024) to tracing complete reasoning paths (2025-2026), making it possible to understand why a model produces a particular output.

Their Transformer Circuits Thread (2021-2026) is the most sustained research program in mechanistic interpretability, progressing through: a mathematical framework for transformer circuits (2021), discovery of induction heads as the basis for in-context learning (2022), toy models of superposition revealing polysemanticity (2022), sparse autoencoders extracting monosemantic features (2023), scaling to millions of features in Claude 3 Sonnet including the Golden Gate Bridge feature (2024), and circuit tracing with attribution graphs revealing end-to-end computational paths (2025). They open-sourced circuit tracing tools for the research community.

They are part of the cross-lab coalition of 40 researchers warning about the growing difficulty of understanding advanced AI models.

Key Contributions

  • Mechanistic interpretability microscope: Identifies features and traces reasoning paths inside Claude (Mechanistic Interpretability)
  • Transformer Circuits Thread (2021-2026): Running research program — mathematical framework, superposition, monosemanticity, circuit tracing (Anthropic Circuit Tracing)
  • Circuit Tracing with attribution graphs (2025): Directed graphs revealing step-by-step computational paths in language models, open-sourced for the community (Anthropic Circuit Tracing)
  • Sparse autoencoders at scale: Extracted millions of interpretable features from Claude 3 Sonnet; discovered steerable features (Golden Gate Bridge) (Anthropic Circuit Tracing)
  • Feature → path progression: Advanced from individual feature identification (2024) to complete reasoning path tracing (2025-2026) (Mechanistic Interpretability)
  • Cross-lab interpretability advocacy: Part of 40-researcher coalition calling for CoT investigation (Mechanistic Interpretability)

Mentioned In

Related Entities

  • Google DeepMind — Collaborator on interpretability research
  • OpenAI — Collaborator on interpretability research
Anthropic | KB | MenFem