Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
First real-world safety evaluation of a deployed personal AI agent (OpenClaw), introducing the CIK taxonomy and showing that poisoning any single dimension raises attack success rate from 24.6% to 64–74%.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
Abstract
The paper presents the first real-world safety evaluation of a deployed personal AI agent. Researchers evaluated OpenClaw—a widely deployed agent with local system access—by introducing the CIK taxonomy (Capability, Identity, Knowledge) to systematize attack surfaces. Testing across four models and twelve scenarios revealed that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64–74%, demonstrating inherent architectural vulnerabilities rather than model-specific flaws.
Key Contributions
- CIK Taxonomy: Unified framework organizing persistent agent state into three dimensions (Capability, Identity, Knowledge) with file-level mappings
- Real-world evaluation: First systematic safety study on live OpenClaw instance with actual Gmail, Stripe, and filesystem integrations
- Comprehensive testing: 12 impact scenarios across 6 harm categories, 4 backbone models, yielding 88 test cases per model
- Defense assessment: Evaluation of three dimension-aligned defenses plus file-protection mechanism, revealing fundamental tradeoffs
- Evolution-safety tradeoff: Demonstrated fundamental tension between agent learning capability and security
Methodology
Two-phase attack protocol: Phase 1 introduces poisoned content into persistent state files; Phase 2 triggers harmful actions in subsequent sessions. Attacks span all three CIK dimensions:
- Knowledge attacks: Memory fabrication (injecting false facts into agent's knowledge store)
- Identity attacks: Trust anchor injection (corrupting agent's self-concept and behavioral guidelines)
- Capability attacks: Executable payload installation (adding malicious tools/capabilities)
All experiments use an automated testing harness managing workspace backup, prompt delivery via Telegram, and outcome verification through external evidence.
Results
| Dimension | Average ASR (Post-Poison) | Baseline ASR |
|---|---|---|
| Knowledge | 74.4% | 24.6% |
| Capability | 68.3% | 24.6% |
| Identity | 64.3% | 24.6% |
- Most robust model (Opus 4.6) exhibited more than threefold increase over its baseline vulnerability
- Capability-focused defense reduced baseline to 1.7% but left Capability attacks at 63.8%
- File protection blocked 97% of malicious injections but also prevented 93% of legitimate updates
- Results demonstrate architectural vulnerabilities independent of underlying model choice
Limitations
- Evaluation covers a single agent platform (OpenClaw) with four backbone models and 12 manually designed impact scenarios
- Cross-dimension attack chaining not fully explored; results represent a lower bound on actual risks
- Future work requires automated attack generation, additional platforms, and architectural safeguards beyond prompt-level defenses
Source: Your Agent, Their Asset by Zijun Wang et al., UC Santa Cruz / NUS / Tencent / ByteDance / UC Berkeley / UNC-Chapel Hill