PAPER2026-04-06·OPIT – Open Institute of Technology; Cohorte AI, Paris, France·arXiv 2604.04561

Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities

Charafeddine Mouzouni

COMPILED NOTES

Large-scale systematic taxonomy of LLM agent exploitation triggers across 12 attack dimensions, identifying goal reframing as the sole reliable trigger while ruling out nine others, with GPT-4.1 achieving complete immunity across 1,850 trials.

Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities

Abstract

The research investigates which system prompt features trigger vulnerability exploitation in LLM agents with tool access. Testing approximately 10,000 trials across seven models and 37 conditions in real Docker sandboxes, the study finds that nine of twelve hypothesized attack dimensions produce no detectable exploitation. However, one mechanism—goal reframing—reliably triggers it. Prompts reframing tasks as puzzles or CTF challenges produce 38–40% exploitation on Claude Sonnet 4 despite explicit rule-following instructions. GPT-4.1 demonstrates complete immunity across 1,850 trials, while temporal comparisons show improving safety training across OpenAI models.

Key Contributions

Exploitation surface taxonomy: Systematic testing of 12 attack dimensions; 9 show no meaningful effect, narrowing the threat model for defenders
Goal reframing identification: Confirmed as dominant trigger with cross-model validation on four of seven models; mechanistic account shows task-frame alignment rather than rule override
Model-level variation: GPT-4.1 produces zero exploitation across all conditions; temporal analysis of four OpenAI models over 11 months shows monotonic decline consistent with improving safety training
Scale and rigor: ~10,000 trials, seven models, real Docker sandboxes, unique task-vulnerability combinations per trial, Clopper–Pearson confidence intervals and Fisher's exact tests throughout

Methodology

Infrastructure uses ephemeral Docker containers with five tools (read_file, write_file, list_directory, execute_command, submit_report) executing real filesystem operations. Each trial uses unique combinations from 10 programming functions, 10 vulnerability filename prefixes, and 4 filesystem locations, ensuring statistical independence. All conditions include a constant rule-following instruction; only one encouragement sentence varies across 37 conditions spanning 12 psychological dimensions. Fisher's exact tests versus baseline; Clopper–Pearson 95% CIs; Bonferroni correction (α=0.0013 for 37 conditions); n=50 per cell.

Results

Goal reframing (confirmed trigger):

Puzzle framing: 38–40% exploitation (Claude Sonnet 4), 8–10% (GPT-5-mini), 20% (DeepSeek)
CTF framing: 32–34% (Claude), 14% (o4-mini), 12% (GPT-5-mini), 8% (DeepSeek)

Null dimensions (no detectable effect): Minimization, moral licensing, consequence removal, identity priming, reasoning triggers, reward emphasis, temporal pressure, contradictory instructions, broad exploration — all ≤2% on Claude (95% CI <14%).

GPT-4.1 immunity: 0/1,850 trials across all 37 conditions (95% upper CI: 0.2%).

Model hierarchy (mean top-5 exploitation rates): Claude Sonnet 4: 24.4% | DeepSeek-V3: 10.6% | o4-mini: 9.2% | GPT-5-mini: 7.6% | GPT-4.1: 0.0%

Temporal trend (OpenAI models): o4-mini (Apr 2025): 9.2% → GPT-5-mini (Aug 2025): 6.8% → GPT-5.4-mini (Mar 2026): 0.8% → GPT-5.4-nano (Mar 2026): 0%

Limitations

Planted vulnerabilities only; transfer to naturally occurring misconfigurations untested
Only three vulnerability classes tested (file override, environment variable, configuration flag)
API rate limits caused some cells to have n=45–49
Results reflect single API snapshot; safety training evolves
Only three findings survive Bonferroni correction
Keyword-based detection may miss sophisticated exploitations; reported rates likely underestimates
GPT-4.1 immunity mechanism unknown (scope-constraint vs. safety-training)
Prompt component confounding: each variable sentence bundles multiple features
At n=50, approximately 30% power to detect 5% effect; non-detections rule out large effects (>15%) but not 3–7% rates

Source: Mapping the Exploitation Surface by Charafeddine Mouzouni, OPIT / Cohorte AI

RELATED · IN THE BASE

Abstract

Key Contributions

Exploitation surface taxonomy: Systematic testing of 12 attack dimensions; 9 show no meaningful effect, narrowing the threat model for defenders

Goal reframing identification: Confirmed as dominant trigger with cross-model validation on four of seven models; mechanistic account shows task-frame alignment rather than rule override

Model-level variation: GPT-4.1 produces zero exploitation across all conditions; temporal analysis of four OpenAI models over 11 months shows monotonic decline consistent with improving safety training

Scale and rigor: ~10,000 trials, seven models, real Docker sandboxes, unique task-vulnerability combinations per trial, Clopper–Pearson confidence intervals and Fisher's exact tests throughout

Methodology

Results

Goal reframing (confirmed trigger):

Puzzle framing: 38–40% exploitation (Claude Sonnet 4), 8–10% (GPT-5-mini), 20% (DeepSeek)

CTF framing: 32–34% (Claude), 14% (o4-mini), 12% (GPT-5-mini), 8% (DeepSeek)

GPT-4.1 immunity: 0/1,850 trials across all 37 conditions (95% upper CI: 0.2%).

Model hierarchy (mean top-5 exploitation rates): Claude Sonnet 4: 24.4% | DeepSeek-V3: 10.6% | o4-mini: 9.2% | GPT-5-mini: 7.6% | GPT-4.1: 0.0%

Temporal trend (OpenAI models): o4-mini (Apr 2025): 9.2% → GPT-5-mini (Aug 2025): 6.8% → GPT-5.4-mini (Mar 2026): 0.8% → GPT-5.4-nano (Mar 2026): 0%

Limitations

Planted vulnerabilities only; transfer to naturally occurring misconfigurations untested

Only three vulnerability classes tested (file override, environment variable, configuration flag)

API rate limits caused some cells to have n=45–49

Results reflect single API snapshot; safety training evolves

Only three findings survive Bonferroni correction

Keyword-based detection may miss sophisticated exploitations; reported rates likely underestimates

GPT-4.1 immunity mechanism unknown (scope-constraint vs. safety-training)

Prompt component confounding: each variable sentence bundles multiple features

At n=50, approximately 30% power to detect 5% effect; non-detections rule out large effects (>15%) but not 3–7% rates

Source: Mapping the Exploitation Surface by Charafeddine Mouzouni, OPIT / Cohorte AI