Deployed Agent Safety

Active Frontier

agent-safetysecuritydeployed-agentspersistent-statecik-taxonomy

Deployed Agent Safety

The safety research community has largely studied AI safety in controlled environments, but the first systematic evaluation of a live deployed personal AI agent reveals qualitatively different risks. Wang et al. (2026) studied OpenClaw — a widely deployed agent with real Gmail, Stripe, and filesystem integrations — introducing the CIK taxonomy to systematize attack surfaces against agents with persistent state.

The CIK Taxonomy organizes a deployed agent's attack surface into three dimensions:

Capability: The tools, APIs, and executable actions available to the agent. Attacks inject malicious tools or payloads.
Identity: The agent's self-concept, behavioral guidelines, and trust anchors. Attacks corrupt who the agent believes it is or what values it should hold.
Knowledge: The agent's memory store and factual beliefs about the world. Attacks fabricate false facts into persistent memory.

The key finding is architectural rather than model-specific: poisoning any single CIK dimension raises the average attack success rate from 24.6% to 64–74%, regardless of which backbone model (including Opus 4.6, the most robust tested) is used. This demonstrates that vulnerabilities are inherent to the persistent-state architecture, not fixable by substituting a safer model.

The evolution-safety tradeoff is the central tension. An agent that can learn from interactions (updating its Knowledge and Identity states) is inherently more capable but also more vulnerable — the same mechanisms that enable adaptive behavior create attack surfaces for persistent poisoning. Defense mechanisms reveal this tension starkly: file protection that blocks 97% of malicious injections also prevents 93% of legitimate updates.

A two-phase attack protocol characterizes the threat model: Phase 1 introduces poisoned content into persistent state files; Phase 2 triggers harmful actions in subsequent sessions. This cross-session threat is absent from single-session safety evaluations and represents a fundamental gap in current evaluation methodology.

Key Claims

CIK taxonomy unifies deployed agent attack surfaces — Capability, Identity, Knowledge dimensions map cleanly to persistent state file types and attack vectors. Evidence: strong (OpenClaw Analysis)
Single-dimension poisoning triples attack success rates — From 24.6% baseline to 64–74% after poisoning any CIK dimension. Evidence: strong (OpenClaw Analysis)
Vulnerabilities are architectural, not model-specific — All four backbone models show similar vulnerability patterns; even the most robust (Opus 4.6) shows more than threefold increase over baseline. Evidence: strong (OpenClaw Analysis)
File protection creates an unworkable tradeoff — Blocking 97% of malicious injections also prevents 93% of legitimate updates, making it incompatible with learning-capable agents. Evidence: strong (OpenClaw Analysis)
Evolution-safety tradeoff is a fundamental architectural tension — The mechanisms enabling agent learning are the same mechanisms enabling persistent poisoning. Evidence: strong (OpenClaw Analysis)

Benchmarks & Data

CIK Dimension Poisoned	Average Attack Success Rate	Baseline
Knowledge	74.4%	24.6%
Capability	68.3%	24.6%
Identity	64.3%	24.6%

Evaluation scope: 4 backbone models, 12 impact scenarios, 6 harm categories, 88 test cases per model (Wang et al.)
Capability defense: reduces baseline to 1.7% but leaves Capability attacks at 63.8% (Wang et al.)
File protection: 97% malicious injection block, 93% legitimate update block (Wang et al.)

Open Questions

Can cross-dimension attack chaining (poisoning multiple CIK dimensions simultaneously) produce synergistic effects beyond single-dimension results?
How do CIK vulnerabilities generalize across agent platforms beyond OpenClaw?
Can cryptographic or provenance-based mechanisms protect Knowledge state without sacrificing learning capability?
What architectural designs fundamentally separate learning-update from external-injection pathways?
How should real-world agent deployment standards incorporate CIK-aware security requirements?

Related Concepts

Agent Safety & Alignment — broader alignment failure modes and red-teaming frameworks
Agent Exploitation Attack Surface — complements with prompt-level exploitation taxonomy (goal reframing)
Agent Memory Architectures — Knowledge dimension attacks target the memory layer directly

Changelog

2026-04-14 — Initial compilation from 1 source (Wang et al. 2026 OpenClaw analysis)

Related Concepts

Sources

openclaw-real-world-agent-safety-analysis

Deployed Agent Safety

The CIK Taxonomy organizes a deployed agent's attack surface into three dimensions:

Capability: The tools, APIs, and executable actions available to the agent. Attacks inject malicious tools or payloads.

Identity: The agent's self-concept, behavioral guidelines, and trust anchors. Attacks corrupt who the agent believes it is or what values it should hold.

Knowledge: The agent's memory store and factual beliefs about the world. Attacks fabricate false facts into persistent memory.

Key Claims

CIK taxonomy unifies deployed agent attack surfaces — Capability, Identity, Knowledge dimensions map cleanly to persistent state file types and attack vectors. Evidence: strong (OpenClaw Analysis)

Single-dimension poisoning triples attack success rates — From 24.6% baseline to 64–74% after poisoning any CIK dimension. Evidence: strong (OpenClaw Analysis)

Vulnerabilities are architectural, not model-specific — All four backbone models show similar vulnerability patterns; even the most robust (Opus 4.6) shows more than threefold increase over baseline. Evidence: strong (OpenClaw Analysis)

File protection creates an unworkable tradeoff — Blocking 97% of malicious injections also prevents 93% of legitimate updates, making it incompatible with learning-capable agents. Evidence: strong (OpenClaw Analysis)

Evolution-safety tradeoff is a fundamental architectural tension — The mechanisms enabling agent learning are the same mechanisms enabling persistent poisoning. Evidence: strong (OpenClaw Analysis)

Benchmarks & Data

CIK Dimension Poisoned	Average Attack Success Rate	Baseline
Knowledge	74.4%	24.6%
Capability	68.3%	24.6%
Identity	64.3%	24.6%

Evaluation scope: 4 backbone models, 12 impact scenarios, 6 harm categories, 88 test cases per model (Wang et al.)

Capability defense: reduces baseline to 1.7% but leaves Capability attacks at 63.8% (Wang et al.)

File protection: 97% malicious injection block, 93% legitimate update block (Wang et al.)

Open Questions

Can cross-dimension attack chaining (poisoning multiple CIK dimensions simultaneously) produce synergistic effects beyond single-dimension results?

How do CIK vulnerabilities generalize across agent platforms beyond OpenClaw?

Can cryptographic or provenance-based mechanisms protect Knowledge state without sacrificing learning capability?

What architectural designs fundamentally separate learning-update from external-injection pathways?

How should real-world agent deployment standards incorporate CIK-aware security requirements?

Deployed Agent Safety

Deployed Agent Safety

Key Claims

Benchmarks & Data

Open Questions

Related Concepts

Changelog

Related Concepts

Agent Exploitation Attack Surface

Agent Memory Architectures

Agent Safety & Alignment

Sources

Deployed Agent Safety

Deployed Agent Safety

Key Claims

Benchmarks & Data

Open Questions

Related Concepts

Changelog

Related Concepts

Agent Exploitation Attack Surface

Agent Memory Architectures

Agent Safety & Alignment

Sources