We measure the residual manipulability of three Anthropic frontier language models (claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5) under three attack scopes: single-request (SR), in-session multi-turn (MT), and cross-session decomposed (CDA). On a pre-registered 4-task × 3-model × 3-schedule × N=20 factorial (2,160 raw model calls), we report end-to-end aggregate harm rate (eAHR) per cell with bootstrap 95% confidence intervals, plus a complete second-judge ablation (claude-sonnet-4-6 as secondary judge, Cohen's κ=0.996, Pearson r=0.856). The pre-data prediction was a clean inversion of the within-vendor capability gradient documented in our companion paper P7 (zenodo.19899470): under CDA, the most capable model (Opus) would be the most vulnerable. The data refute the clean-inversion prediction and supply a richer finding: Opus exhibits the LOWEST cross-task CDA mean harm (1.47 vs Sonnet 1.63 vs Haiku 1.64), with the mechanism being defense-in-depth at the sub-task layer. The most capable model's sub-task outputs include anti-aggregation markers (trivial-topic example substitutions, fantasy-context wrappers) that frustrate cross-session assembly into the harm artifact specified by the composite task. The mechanism is most cleanly visible on a statistical-cherry-pick op-ed task (T3), where Opus consistently substitutes pineapple-pizza, coffee, or tea as the example topic for op-ed structural templates. We argue that "alignment robustness" is multi-layered, not monotonic: a model can defend at the request layer (P7) and at the sub-task layer (this paper) simultaneously. The defense direction we identify — telemetry-level workflow detection and per-account behavioral fingerprinting — operates at the layer at which CDA actually composes. A substantive multi-jurisdictional ethics-and-lawful-use treatment (US, China, EU, UK, Canada) is included.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hangyu Mei (Thu,) studied this question.
www.synapsesocial.com/papers/69f5951171405d493a000002 — DOI: https://doi.org/10.5281/zenodo.19925755
Hangyu Mei
Building similarity graph...
Analyzing shared references across papers
Loading...