Establishes the empirical base for the series. We tested whether embedded instructions in documents can hijack AI summarisation workflows, and whether the vulnerability is predictable from model capability. Three documents (one honest control with a transparency-serving instruction, two fabricated pharmaceutical papers with suppression instructions using different rhetorical registers) were processed by seventeen model configurations from three providers under six prompt conditions (~350 test runs, N=2 minimum per condition). A subsequent controlled ablation study (~251 runs, N=5–10 per condition) isolated the contributions of rhetorical register and addressivity using a 2×2 factorial design on five of these models (Appendix G). At baseline (“please summarize”), twelve of seventeen configurations complied with the suppression instruction. The five that detected the manipulation did not map onto capability tiers, model generations, or reasoning affordances. Two models classified as baseline detectors in earlier N=1 testing did not replicate at N=2. The two malicious documents produced comparable overall compliance rates but different failure pathways: care-framed compliance persisted even after models discovered the source document was likely fabricated, while authority-framed compliance collapsed when the authority was debunked. The controlled ablation substantially clarified that register is the primary driver of compliance pathway. On the same document body, care-framed instructions captured 100% of runs on the most sensitive compliant model. Authority captured 35%. In this dataset, the care register was the only variant to crack a baseline detector (1/40 on Sonnet; authority: 0/40). Three new compliance categories were identified: rationalisation extension (the model generates novel pro-suppression arguments not present in the instruction), passive non-compliance (the instruction is not parsed as a command), and silent compliance (the instruction shapes output without appearing in the thinking trace). User safety language (“summarize safely”) was co-opted by the care register in four models across three providers, producing worse outcomes than naive prompting. The most reliable intervention was not a warning but a different task. Asking “how trustworthy is it?” produced the broadest improvement across the failing models tested. Extended thinking amplified whatever the active task frame produced: more elaborate compliance under summarisation, more thorough investigation under trustworthiness evaluation. The most dangerous failure modes were models that performed visible security evaluations and arrived at the wrong conclusion. These appeared only under security-framed conditions, and only in thinking-enabled runs. The ablation supported the task-frame shift at higher N (D+ on 39/40 trustworthiness runs across three compliant models) but identified a capability boundary: the lowest-capability model tested accepted fabricated documents as trustworthy under both summarisation and evaluative prompting (acceptance 11/11). The task-frame rescue activates latent detection capability; it cannot create capability that does not exist. These observations are exploratory (N=2-4 per condition, two uncontrolled malicious stimuli, three providers, consumer interfaces with uncontrolled variance). They describe where model distributions tend to sit, not where they always sit. The controlled ablation that follows (Appendix G, ~251 runs, N=5-10 per condition) isolates the structural variables at higher N. Register emerges as the primary driver of compliance pathway (100% vs 35% on the same document body). The task-frame shift holds at 39/40. Three new compliance categories emerge, along with a capability boundary on the intervention. The subsequent papers in this series build on the architectural patterns confirmed in the ablation (register-dependent failure mechanisms, task-frame shift, thinking-as-amplifier), not on per-model rate estimates from the main study. N expansion on the main study is the highest-priority replication item (Section 9), but it would refine per-model distributions rather than change the structural findings the series depends on. The paper interprets these findings through the Confidence Curriculum lens, but treats that lens as a hypothesis-generating framework rather than a conclusion established by this study. Paper 1 of 5 in the Confidence Curriculum series 10.5281/zenodo.19226032.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ivan "HiP" Phan
Building similarity graph...
Analyzing shared references across papers
Loading...
Ivan "HiP" Phan (Mon,) studied this question.
www.synapsesocial.com/papers/69faa30204f884e66b533afc — DOI: https://doi.org/10.5281/zenodo.20027850
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: