What question did this study set out to answer?

The aim is to investigate the vulnerability of different aligned language models to compositional decomposition attacks under various request scenarios.

May 2, 2026Open Access

Compositional Decomposition Attacks: Workflow-Scale Vulnerability of Aligned Language Models Under Cross-Session Request Fragmentation

Key Points

The aim is to investigate the vulnerability of different aligned language models to compositional decomposition attacks under various request scenarios.
Pre-registered factorial design with 4 tasks and 3 language models using 2,160 model calls.
Measurement of end-to-end aggregate harm rate with bootstrap confidence intervals.
Secondary judge analysis using Cohen's κ and Pearson correlation to verify findings.
Opus exhibited the lowest mean harm under cross-task composition (1.47) compared to Sonnet (1.63) and Haiku (1.64).
Alignment robustness reflects multi-layered defense strategies at both the request and sub-task layers.
Statistical cherry-pick tasks revealed Opus consistently substituting trivial topics to disrupt harmful outputs.

Abstract

We measure the residual manipulability of three Anthropic frontier language models (claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5) under three attack scopes: single-request (SR), in-session multi-turn (MT), and cross-session decomposed (CDA). On a pre-registered 4-task × 3-model × 3-schedule × N=20 factorial (2,160 raw model calls), we report end-to-end aggregate harm rate (eAHR) per cell with bootstrap 95% confidence intervals, plus a complete second-judge ablation (claude-sonnet-4-6 as secondary judge, Cohen's κ=0.996, Pearson r=0.856). The pre-data prediction was a clean inversion of the within-vendor capability gradient documented in our companion paper P7 (zenodo.19899470): under CDA, the most capable model (Opus) would be the most vulnerable. The data refute the clean-inversion prediction and supply a richer finding: Opus exhibits the LOWEST cross-task CDA mean harm (1.47 vs Sonnet 1.63 vs Haiku 1.64), with the mechanism being defense-in-depth at the sub-task layer. The most capable model's sub-task outputs include anti-aggregation markers (trivial-topic example substitutions, fantasy-context wrappers) that frustrate cross-session assembly into the harm artifact specified by the composite task. The mechanism is most cleanly visible on a statistical-cherry-pick op-ed task (T3), where Opus consistently substitutes pineapple-pizza, coffee, or tea as the example topic for op-ed structural templates. We argue that "alignment robustness" is multi-layered, not monotonic: a model can defend at the request layer (P7) and at the sub-task layer (this paper) simultaneously. The defense direction we identify — telemetry-level workflow detection and per-account behavioral fingerprinting — operates at the layer at which CDA actually composes. A substantive multi-jurisdictional ethics-and-lawful-use treatment (US, China, EU, UK, Canada) is included.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hangyu Mei

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Compositional Decomposition Attacks: Workflow-Scale Vulnerability of Aligned Language Models Under Cross-Session Request Fragmentation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider