PAPER2026-04-06·UC Santa Cruz, National University of Singapore, Tencent, ByteDance, UC Berkeley, UNC-Chapel Hill·arXiv 2604.04759

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu, Zhenlong Yuan, Tianyu Pang, Michael Qizhe Shieh, Fengze Liu, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie

COMPILED NOTES

First real-world safety evaluation of a deployed personal AI agent (OpenClaw), introducing the CIK taxonomy and showing that poisoning any single dimension raises attack success rate from 24.6% to 64–74%.

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Abstract

The paper presents the first real-world safety evaluation of a deployed personal AI agent. Researchers evaluated OpenClaw—a widely deployed agent with local system access—by introducing the CIK taxonomy (Capability, Identity, Knowledge) to systematize attack surfaces. Testing across four models and twelve scenarios revealed that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64–74%, demonstrating inherent architectural vulnerabilities rather than model-specific flaws.

Key Contributions

CIK Taxonomy: Unified framework organizing persistent agent state into three dimensions (Capability, Identity, Knowledge) with file-level mappings
Real-world evaluation: First systematic safety study on live OpenClaw instance with actual Gmail, Stripe, and filesystem integrations
Comprehensive testing: 12 impact scenarios across 6 harm categories, 4 backbone models, yielding 88 test cases per model
Defense assessment: Evaluation of three dimension-aligned defenses plus file-protection mechanism, revealing fundamental tradeoffs
Evolution-safety tradeoff: Demonstrated fundamental tension between agent learning capability and security

Methodology

Two-phase attack protocol: Phase 1 introduces poisoned content into persistent state files; Phase 2 triggers harmful actions in subsequent sessions. Attacks span all three CIK dimensions:

Knowledge attacks: Memory fabrication (injecting false facts into agent's knowledge store)
Identity attacks: Trust anchor injection (corrupting agent's self-concept and behavioral guidelines)
Capability attacks: Executable payload installation (adding malicious tools/capabilities)

All experiments use an automated testing harness managing workspace backup, prompt delivery via Telegram, and outcome verification through external evidence.

Results

Dimension	Average ASR (Post-Poison)	Baseline ASR
Knowledge	74.4%	24.6%
Capability	68.3%	24.6%
Identity	64.3%	24.6%

Most robust model (Opus 4.6) exhibited more than threefold increase over its baseline vulnerability
Capability-focused defense reduced baseline to 1.7% but left Capability attacks at 63.8%
File protection blocked 97% of malicious injections but also prevented 93% of legitimate updates
Results demonstrate architectural vulnerabilities independent of underlying model choice

Limitations

Evaluation covers a single agent platform (OpenClaw) with four backbone models and 12 manually designed impact scenarios
Cross-dimension attack chaining not fully explored; results represent a lower bound on actual risks
Future work requires automated attack generation, additional platforms, and architectural safeguards beyond prompt-level defenses

Source: Your Agent, Their Asset by Zijun Wang et al., UC Santa Cruz / NUS / Tencent / ByteDance / UC Berkeley / UNC-Chapel Hill

RELATED · IN THE BASE

Abstract

Key Contributions

CIK Taxonomy: Unified framework organizing persistent agent state into three dimensions (Capability, Identity, Knowledge) with file-level mappings

Real-world evaluation: First systematic safety study on live OpenClaw instance with actual Gmail, Stripe, and filesystem integrations

Comprehensive testing: 12 impact scenarios across 6 harm categories, 4 backbone models, yielding 88 test cases per model

Defense assessment: Evaluation of three dimension-aligned defenses plus file-protection mechanism, revealing fundamental tradeoffs

Evolution-safety tradeoff: Demonstrated fundamental tension between agent learning capability and security

Methodology

Two-phase attack protocol: Phase 1 introduces poisoned content into persistent state files; Phase 2 triggers harmful actions in subsequent sessions. Attacks span all three CIK dimensions:

Knowledge attacks: Memory fabrication (injecting false facts into agent's knowledge store)

Identity attacks: Trust anchor injection (corrupting agent's self-concept and behavioral guidelines)

Capability attacks: Executable payload installation (adding malicious tools/capabilities)

All experiments use an automated testing harness managing workspace backup, prompt delivery via Telegram, and outcome verification through external evidence.

Results

Dimension	Average ASR (Post-Poison)	Baseline ASR
Knowledge	74.4%	24.6%
Capability	68.3%	24.6%
Identity	64.3%	24.6%

Most robust model (Opus 4.6) exhibited more than threefold increase over its baseline vulnerability

Capability-focused defense reduced baseline to 1.7% but left Capability attacks at 63.8%

File protection blocked 97% of malicious injections but also prevented 93% of legitimate updates

Results demonstrate architectural vulnerabilities independent of underlying model choice

Limitations

Evaluation covers a single agent platform (OpenClaw) with four backbone models and 12 manually designed impact scenarios

Cross-dimension attack chaining not fully explored; results represent a lower bound on actual risks

Future work requires automated attack generation, additional platforms, and architectural safeguards beyond prompt-level defenses

Source: Your Agent, Their Asset by Zijun Wang et al., UC Santa Cruz / NUS / Tencent / ByteDance / UC Berkeley / UNC-Chapel Hill