What does this research mean for the field?

Toxic intent is separable in mid-depth layers of large language models, achieving an AUC of at least 0.90. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The research aims to evaluate and quantify the effects of adversarial attacks on large language models through cognitive assessment methods.

March 4, 2026Open Access

Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Key Points

The research aims to evaluate and quantify the effects of adversarial attacks on large language models through cognitive assessment methods.
Developed a CLI tool for instant result verification.
Utilized Jupyter Notebooks for in-depth claim verification.
Tested model vulnerabilities using various adversarial prompts.
Evaluated model performance metrics such as accuracy and AUC.
Layer-wise separability for toxic intent achieved AUC ≥ 0.90.
Cognitive drift shows significant hidden state divergence during attacks.
Multi-layer sentinels demonstrated over 94% accuracy in attack scenarios.
The proposed method outperformed existing baselines like Llama-Guard-3-8B.

Abstract

Artifact Evaluation: Quantifying Large Language Model Attacks Through the Lens of Model Cognition Paper ID: #1781 (USENIX Security '26) Title: Quantifying Large Language Model Attacks Through the Lens of Model Cognition 📖 Overview This repository contains the artifact for the paper "Quantifying Large Language Model Attacks Through the Lens of Model Cognition" (USENIX Security 2026). It provides all necessary data, code, and scripts to validate our claims and reproduce the experimental results reported in the paper. We provide two primary modes for evaluation: Instant Verification (src/quickₛtart. py): A CLI tool to instantly query and retrieve specific experimental results (Accuracy, AUC, etc. ) directly from the pre-computed logs used in the paper. Claim Reproduction (src/claims/*. ipynb): A set of modular Jupyter Notebooks that execute the actual pipeline—from training probes to evaluating sentinels—allowing for deep inspection and reproduction of specific claims. 📂 Directory Structure. ├── data/ # Datasets used for training probes and conducting attacks (e. g. , fixed. json, adversarial prompts). ├── models/ # Directory where LLMs and baseline models will be downloaded. ├── results/ # JSON files containing the finalized experimental metrics reported in the paper. └── src/ # Source code and executable scripts. ├── claims/ # Modular notebooks for verifying individual claims (C1-C4). │ ├── claim1. ipynb # Layer-wise Separability │ ├── claim2. ipynb # Cognitive Drift │ ├── claim3. ipynb # Sentinel Construction │ └── claim4. ipynb # Baseline Comparison ├── installation. ipynb # Setup script for environment and model downloading. ├── basicₜest. py # Script to verify GPU and dependency status. └── quickₛtart. py # CLI tool to query results directly from the 'results/' folder. 💻 Hardware & Performance Reference The code is optimized to run on standard research hardware. The provided reproduction example uses Qwen3-4B. Recommended GPU: 1x NVIDIA A100 (40GB VRAM) is fully sufficient. Estimated Runtime: Approximately 3 hours for the complete pipeline (Taking Qwen3-4B as an Example, Model download Extraction Training Evaluation). 🚀 Getting Started 1. Installation & Setup We rely on Conda for environment management. Please follow these steps: Open and run the src/installation. ipynb notebook. It will create the lac environment (Python 3. 10). It will install all dependencies. It will download the required models (Qwen3-4B, Llama-Guard-3-8B, etc. ). Note: Ensure you activate the lac kernel for all subsequent notebooks. 2. Basic Functionality Test To ensure your environment and GPU are configured correctly before running heavy experiments: conda activate lac python src/basicₜest. py Expected Output: "Ready for reproduction!" (along with GPU details). 🧪 Evaluation Modes Mode A: Instant Result Verification (Experiment E5) If you wish to quickly verify specific numbers cited in the paper (e. g. , Table 2) without running the training pipeline, use the src/quickₛtart. py script to parse pre-computed logs. Usage: conda activate lac python src/quickₛtart. py --model Qwen3-4B --method Multi-layer --dataset Sneaky Run python quickₛtart. py --help for full options. Mode B: Claim Verification & Reproduction (Experiments E1-E4) To reproduce the experiments and verify specific major claims, run the corresponding notebooks in the src/claims/ directory. Detailed step-by-step instructions and code explanations are provided directly within each notebook. Claim (Cx) Description Experiment Notebook C1 Layer-wise Separability: Toxic intent is separable (AUC 0. 90) in mid-depth layers. E1 src/claims/claim1. ipynb C2 Cognitive Drift: Adversarial perturbations cause significant hidden state divergence correlating with attack success. E2 src/claims/claim2. ipynb C3 Sentinel Effectiveness: Multi-layer sentinel achieves >94% accuracy, outperforming single layers. E3 src/claims/claim3. ipynb C4 Superiority: Our method outperforms Llama-Guard-3-8B and other baselines on stealthy attacks. E4 src/claims/claim4. ipynb ⚙️ Customization The default scripts use Qwen3-4B. To reproduce results for other models mentioned in the paper (e. g. , Llama-3. 1-8B-Instruction), you need to: Download Model: Modify modelᵢds in src/installation. ipynb to download the target model. Update Model Path: Update the MODELNAME variable in the respective src/claims/claim*. ipynb notebooks. Update Output Path: Change the subfolderₙame variable (e. g. , to llama₈b) in the notebooks to ensure results are saved in a separate directory. Update Layer Count: Adjust the layer range loop (e. g. , range (0, 36) ) to match the total number of hidden layers of the new model (e. g. , Qwen3-4B has 36 layers, while Llama-3. 1-8B-Instruction has 32 layers). 🔗 Correspondence with Open Science Policy In accordance with the USENIX Security 2026 open-science policy, this artifact fulfills the specific commitments made in the Open Science section of our paper: Open Science Commitment Corresponding Artifact Component 1. Source Code src/ Folder: Contains the full codebase. The src/claims/ notebooks provide the probing framework implementation. 2. Data Access data/ Folder: Contains scripts and pre-processed files for training sets (NSFW-56k/GPT-4o) and benchmarks (I2P, Sneaky, MMA, Labelled). 3. Probe Training src/claims/claim1. ipynb: Contains the exact training logic and hyperparameters to train probes from scratch locally. 4. Hidden States src/claims/claim1. ipynb: Demonstrates on-the-fly extraction from local LLMs, verifying no reliance on cached proprietary tensors. 5. Reproducibility src/quickₛtart. py & src/claims/: Allows both instant verification of Table 2 data and full regeneration of results (Figure 4, Figure 5, Table 2).

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Social Feed

Authors

Liu Xiuming

He Chaoxiang

Yu Xuanran

Actions

Institutions

Shanghai Jiao Tong University

Microsoft Research Asia (China)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Social Feed

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider