Artifact Evaluation: Quantifying Large Language Model Attacks Through the Lens of Model Cognition Paper ID: #1781 (USENIX Security '26) Title: Quantifying Large Language Model Attacks Through the Lens of Model Cognition 📖 Overview This repository contains the artifact for the paper "Quantifying Large Language Model Attacks Through the Lens of Model Cognition" (USENIX Security 2026). It provides all necessary data, code, and scripts to validate our claims and reproduce the experimental results reported in the paper. We provide two primary modes for evaluation: Instant Verification (src/quickₛtart. py): A CLI tool to instantly query and retrieve specific experimental results (Accuracy, AUC, etc. ) directly from the pre-computed logs used in the paper. Claim Reproduction (src/claims/*. ipynb): A set of modular Jupyter Notebooks that execute the actual pipeline—from training probes to evaluating sentinels—allowing for deep inspection and reproduction of specific claims. 📂 Directory Structure. ├── data/ # Datasets used for training probes and conducting attacks (e. g. , fixed. json, adversarial prompts). ├── models/ # Directory where LLMs and baseline models will be downloaded. ├── results/ # JSON files containing the finalized experimental metrics reported in the paper. └── src/ # Source code and executable scripts. ├── claims/ # Modular notebooks for verifying individual claims (C1-C4). │ ├── claim1. ipynb # Layer-wise Separability │ ├── claim2. ipynb # Cognitive Drift │ ├── claim3. ipynb # Sentinel Construction │ └── claim4. ipynb # Baseline Comparison ├── installation. ipynb # Setup script for environment and model downloading. ├── basicₜest. py # Script to verify GPU and dependency status. └── quickₛtart. py # CLI tool to query results directly from the 'results/' folder. 💻 Hardware & Performance Reference The code is optimized to run on standard research hardware. The provided reproduction example uses Qwen3-4B. Recommended GPU: 1x NVIDIA A100 (40GB VRAM) is fully sufficient. Estimated Runtime: Approximately 3 hours for the complete pipeline (Taking Qwen3-4B as an Example, Model download Extraction Training Evaluation). 🚀 Getting Started 1. Installation & Setup We rely on Conda for environment management. Please follow these steps: Open and run the src/installation. ipynb notebook. It will create the lac environment (Python 3. 10). It will install all dependencies. It will download the required models (Qwen3-4B, Llama-Guard-3-8B, etc. ). Note: Ensure you activate the lac kernel for all subsequent notebooks. 2. Basic Functionality Test To ensure your environment and GPU are configured correctly before running heavy experiments: conda activate lac python src/basicₜest. py Expected Output: "Ready for reproduction!" (along with GPU details). 🧪 Evaluation Modes Mode A: Instant Result Verification (Experiment E5) If you wish to quickly verify specific numbers cited in the paper (e. g. , Table 2) without running the training pipeline, use the src/quickₛtart. py script to parse pre-computed logs. Usage: conda activate lac python src/quickₛtart. py --model Qwen3-4B --method Multi-layer --dataset Sneaky Run python quickₛtart. py --help for full options. Mode B: Claim Verification & Reproduction (Experiments E1-E4) To reproduce the experiments and verify specific major claims, run the corresponding notebooks in the src/claims/ directory. Detailed step-by-step instructions and code explanations are provided directly within each notebook. Claim (Cx) Description Experiment Notebook C1 Layer-wise Separability: Toxic intent is separable (AUC 0. 90) in mid-depth layers. E1 src/claims/claim1. ipynb C2 Cognitive Drift: Adversarial perturbations cause significant hidden state divergence correlating with attack success. E2 src/claims/claim2. ipynb C3 Sentinel Effectiveness: Multi-layer sentinel achieves >94% accuracy, outperforming single layers. E3 src/claims/claim3. ipynb C4 Superiority: Our method outperforms Llama-Guard-3-8B and other baselines on stealthy attacks. E4 src/claims/claim4. ipynb ⚙️ Customization The default scripts use Qwen3-4B. To reproduce results for other models mentioned in the paper (e. g. , Llama-3. 1-8B-Instruction), you need to: Download Model: Modify modelᵢds in src/installation. ipynb to download the target model. Update Model Path: Update the MODELNAME variable in the respective src/claims/claim*. ipynb notebooks. Update Output Path: Change the subfolderₙame variable (e. g. , to llama₈b) in the notebooks to ensure results are saved in a separate directory. Update Layer Count: Adjust the layer range loop (e. g. , range (0, 36) ) to match the total number of hidden layers of the new model (e. g. , Qwen3-4B has 36 layers, while Llama-3. 1-8B-Instruction has 32 layers). 🔗 Correspondence with Open Science Policy In accordance with the USENIX Security 2026 open-science policy, this artifact fulfills the specific commitments made in the Open Science section of our paper: Open Science Commitment Corresponding Artifact Component 1. Source Code src/ Folder: Contains the full codebase. The src/claims/ notebooks provide the probing framework implementation. 2. Data Access data/ Folder: Contains scripts and pre-processed files for training sets (NSFW-56k/GPT-4o) and benchmarks (I2P, Sneaky, MMA, Labelled). 3. Probe Training src/claims/claim1. ipynb: Contains the exact training logic and hyperparameters to train probes from scratch locally. 4. Hidden States src/claims/claim1. ipynb: Demonstrates on-the-fly extraction from local LLMs, verifying no reliance on cached proprietary tensors. 5. Reproducibility src/quickₛtart. py & src/claims/: Allows both instant verification of Table 2 data and full regeneration of results (Figure 4, Figure 5, Table 2).
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu Xiuming
He Chaoxiang
Yu Xuanran
Shanghai Jiao Tong University
Microsoft Research Asia (China)
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiuming et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69a7cc8ed48f933b5eed81fe — DOI: https://doi.org/10.5281/zenodo.18834566
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: