Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities
Large-scale systematic taxonomy of LLM agent exploitation triggers across 12 attack dimensions, identifying goal reframing as the sole reliable trigger while ruling out nine others, with GPT-4.1 achieving complete immunity across 1,850 trials.
Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities
Abstract
The research investigates which system prompt features trigger vulnerability exploitation in LLM agents with tool access. Testing approximately 10,000 trials across seven models and 37 conditions in real Docker sandboxes, the study finds that nine of twelve hypothesized attack dimensions produce no detectable exploitation. However, one mechanism—goal reframing—reliably triggers it. Prompts reframing tasks as puzzles or CTF challenges produce 38–40% exploitation on Claude Sonnet 4 despite explicit rule-following instructions. GPT-4.1 demonstrates complete immunity across 1,850 trials, while temporal comparisons show improving safety training across OpenAI models.
Key Contributions
- Exploitation surface taxonomy: Systematic testing of 12 attack dimensions; 9 show no meaningful effect, narrowing the threat model for defenders
- Goal reframing identification: Confirmed as dominant trigger with cross-model validation on four of seven models; mechanistic account shows task-frame alignment rather than rule override
- Model-level variation: GPT-4.1 produces zero exploitation across all conditions; temporal analysis of four OpenAI models over 11 months shows monotonic decline consistent with improving safety training
- Scale and rigor: ~10,000 trials, seven models, real Docker sandboxes, unique task-vulnerability combinations per trial, Clopper–Pearson confidence intervals and Fisher's exact tests throughout
Methodology
Infrastructure uses ephemeral Docker containers with five tools (read_file, write_file, list_directory, execute_command, submit_report) executing real filesystem operations. Each trial uses unique combinations from 10 programming functions, 10 vulnerability filename prefixes, and 4 filesystem locations, ensuring statistical independence. All conditions include a constant rule-following instruction; only one encouragement sentence varies across 37 conditions spanning 12 psychological dimensions. Fisher's exact tests versus baseline; Clopper–Pearson 95% CIs; Bonferroni correction (α=0.0013 for 37 conditions); n=50 per cell.
Results
Goal reframing (confirmed trigger):
- Puzzle framing: 38–40% exploitation (Claude Sonnet 4), 8–10% (GPT-5-mini), 20% (DeepSeek)
- CTF framing: 32–34% (Claude), 14% (o4-mini), 12% (GPT-5-mini), 8% (DeepSeek)
Null dimensions (no detectable effect): Minimization, moral licensing, consequence removal, identity priming, reasoning triggers, reward emphasis, temporal pressure, contradictory instructions, broad exploration — all ≤2% on Claude (95% CI <14%).
GPT-4.1 immunity: 0/1,850 trials across all 37 conditions (95% upper CI: 0.2%).
Model hierarchy (mean top-5 exploitation rates): Claude Sonnet 4: 24.4% | DeepSeek-V3: 10.6% | o4-mini: 9.2% | GPT-5-mini: 7.6% | GPT-4.1: 0.0%
Temporal trend (OpenAI models): o4-mini (Apr 2025): 9.2% → GPT-5-mini (Aug 2025): 6.8% → GPT-5.4-mini (Mar 2026): 0.8% → GPT-5.4-nano (Mar 2026): 0%
Limitations
- Planted vulnerabilities only; transfer to naturally occurring misconfigurations untested
- Only three vulnerability classes tested (file override, environment variable, configuration flag)
- API rate limits caused some cells to have n=45–49
- Results reflect single API snapshot; safety training evolves
- Only three findings survive Bonferroni correction
- Keyword-based detection may miss sophisticated exploitations; reported rates likely underestimates
- GPT-4.1 immunity mechanism unknown (scope-constraint vs. safety-training)
- Prompt component confounding: each variable sentence bundles multiple features
- At n=50, approximately 30% power to detect 5% effect; non-detections rule out large effects (>15%) but not 3–7% rates
Source: Mapping the Exploitation Surface by Charafeddine Mouzouni, OPIT / Cohorte AI