We measure the residual manipulability of three Anthropic frontier language models (claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5) under three attack scopes: single-request (SR), in-session multi-turn (MT), and cross-session decomposed (CDA). On a pre-registered 4-task × 3-model × 3-schedule × N=20 factorial (2,160 raw model calls), we report end-to-end aggregate harm rate (eAHR) per cell with bootstrap 95% confidence intervals, plus a complete second-judge ablation (claude-sonnet-4-6 as secondary judge, Cohen's κ=0.996, Pearson r=0.856). The pre-data prediction was a clean inversion of the within-vendor capability gradient documented in our companion paper P7 (zenodo.19899470): under CDA, the most capable model (Opus) would be the most vulnerable. The data refute the clean-inversion prediction and supply a richer finding: Opus exhibits the LOWEST cross-task CDA mean harm (1.47 vs Sonnet 1.63 vs Haiku 1.64), with the mechanism being defense-in-depth at the sub-task layer. The most capable model's sub-task outputs include anti-aggregation markers (trivial-topic example substitutions, fantasy-context wrappers) that frustrate cross-session assembly into the harm artifact specified by the composite task. The mechanism is most cleanly visible on a statistical-cherry-pick op-ed task (T3), where Opus consistently substitutes pineapple-pizza, coffee, or tea as the example topic for op-ed structural templates. We argue that "alignment robustness" is multi-layered, not monotonic: a model can defend at the request layer (P7) and at the sub-task layer (this paper) simultaneously. The defense direction we identify — telemetry-level workflow detection and per-account behavioral fingerprinting — operates at the layer at which CDA actually composes. A substantive multi-jurisdictional ethics-and-lawful-use treatment (US, China, EU, UK, Canada) is included.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hangyu Mei
Building similarity graph...
Analyzing shared references across papers
Loading...
Hangyu Mei (Thu,) studied this question.
www.synapsesocial.com/papers/69f5951171405d493a000002 — DOI: https://doi.org/10.5281/zenodo.19925755
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: