Deployed Agent Safety
Active FrontierDeployed Agent Safety
The safety research community has largely studied AI safety in controlled environments, but the first systematic evaluation of a live deployed personal AI agent reveals qualitatively different risks. Wang et al. (2026) studied OpenClaw — a widely deployed agent with real Gmail, Stripe, and filesystem integrations — introducing the CIK taxonomy to systematize attack surfaces against agents with persistent state.
The CIK Taxonomy organizes a deployed agent's attack surface into three dimensions:
- Capability: The tools, APIs, and executable actions available to the agent. Attacks inject malicious tools or payloads.
- Identity: The agent's self-concept, behavioral guidelines, and trust anchors. Attacks corrupt who the agent believes it is or what values it should hold.
- Knowledge: The agent's memory store and factual beliefs about the world. Attacks fabricate false facts into persistent memory.
The key finding is architectural rather than model-specific: poisoning any single CIK dimension raises the average attack success rate from 24.6% to 64–74%, regardless of which backbone model (including Opus 4.6, the most robust tested) is used. This demonstrates that vulnerabilities are inherent to the persistent-state architecture, not fixable by substituting a safer model.
The evolution-safety tradeoff is the central tension. An agent that can learn from interactions (updating its Knowledge and Identity states) is inherently more capable but also more vulnerable — the same mechanisms that enable adaptive behavior create attack surfaces for persistent poisoning. Defense mechanisms reveal this tension starkly: file protection that blocks 97% of malicious injections also prevents 93% of legitimate updates.
A two-phase attack protocol characterizes the threat model: Phase 1 introduces poisoned content into persistent state files; Phase 2 triggers harmful actions in subsequent sessions. This cross-session threat is absent from single-session safety evaluations and represents a fundamental gap in current evaluation methodology.
Key Claims
- CIK taxonomy unifies deployed agent attack surfaces — Capability, Identity, Knowledge dimensions map cleanly to persistent state file types and attack vectors. Evidence: strong (OpenClaw Analysis)
- Single-dimension poisoning triples attack success rates — From 24.6% baseline to 64–74% after poisoning any CIK dimension. Evidence: strong (OpenClaw Analysis)
- Vulnerabilities are architectural, not model-specific — All four backbone models show similar vulnerability patterns; even the most robust (Opus 4.6) shows more than threefold increase over baseline. Evidence: strong (OpenClaw Analysis)
- File protection creates an unworkable tradeoff — Blocking 97% of malicious injections also prevents 93% of legitimate updates, making it incompatible with learning-capable agents. Evidence: strong (OpenClaw Analysis)
- Evolution-safety tradeoff is a fundamental architectural tension — The mechanisms enabling agent learning are the same mechanisms enabling persistent poisoning. Evidence: strong (OpenClaw Analysis)
Benchmarks & Data
| CIK Dimension Poisoned | Average Attack Success Rate | Baseline |
|---|---|---|
| Knowledge | 74.4% | 24.6% |
| Capability | 68.3% | 24.6% |
| Identity | 64.3% | 24.6% |
- Evaluation scope: 4 backbone models, 12 impact scenarios, 6 harm categories, 88 test cases per model (Wang et al.)
- Capability defense: reduces baseline to 1.7% but leaves Capability attacks at 63.8% (Wang et al.)
- File protection: 97% malicious injection block, 93% legitimate update block (Wang et al.)
Open Questions
- Can cross-dimension attack chaining (poisoning multiple CIK dimensions simultaneously) produce synergistic effects beyond single-dimension results?
- How do CIK vulnerabilities generalize across agent platforms beyond OpenClaw?
- Can cryptographic or provenance-based mechanisms protect Knowledge state without sacrificing learning capability?
- What architectural designs fundamentally separate learning-update from external-injection pathways?
- How should real-world agent deployment standards incorporate CIK-aware security requirements?
Related Concepts
- Agent Safety & Alignment — broader alignment failure modes and red-teaming frameworks
- Agent Exploitation Attack Surface — complements with prompt-level exploitation taxonomy (goal reframing)
- Agent Memory Architectures — Knowledge dimension attacks target the memory layer directly
Changelog
- 2026-04-14 — Initial compilation from 1 source (Wang et al. 2026 OpenClaw analysis)