Large Language Models (LLMs) are increasingly deployed in decision-support systems across high-stakes domains, yet their susceptibility to cognitive biases—systematic deviations from rational judgment well-documented in human psychology—remains poorly understood in quantitative terms. Existing studies typically examine a narrow set of biases, test a single model family, and rely on qualitative assessments of bias presence. In this work, we present a rigorous experimental framework, inspired by the methodology of experimental physics, for the systematic quantitative measurement of cognitive biases in LLMs. We introduce the Bias Strength Index (BSI), a normalized metric with associated confidence intervals that quantifies the magnitude of bias on a continuous scale, and we decompose the total uncertainty into statistical and systematic components—the latter arising from prompt reformulation. We evaluate a comprehensive taxonomy of eleven cognitive biases (including anchoring, framing effect, confirmation bias, availability heuristic, sunk cost fallacy, bandwagon effect, status quo bias, and others) across eight state-of-the-art LLMs from seven families: GPT-4.1 Mini, Claude 3.5 Sonnet, Gemini 2.5 Flash, Llama 3.3 70B, Llama 3.1 8B, Mistral Large (mistral-large-2411), DeepSeek V3, and MiniMax M2.5. Each bias is probed through multiple semantically equivalent prompt variants, with N = 100 independent trials per configuration, yielding a dataset of over 70,000 model responses. Our results reveal that all tested models exhibit non-zero bias effects for multiple bias categories, though with markedly different profiles. A trial-level Generalized Linear Mixed-Effects Model (GLMM) analysis finds statistically significant bias effects in 27 of 43 testable bias–model combinations (62.8%) after multiple-comparison correction, while a more conservative variant-level test—which requires effects to generalize across prompt formulations—yields only one significant result, highlighting the dominant role of prompt-induced systematic uncertainty. Framing and primacy/recency effects are near-universal, while susceptibility to other biases varies substantially across model families. We further evaluate three debiasing strategies—zero-shot chain-of-thought, adversarial counter-prompting, and role-based prompting—applied at inference time without modifying model weights. Our findings provide a quantitative foundation for auditing cognitive biases in LLMs and highlight the bias-dependent effectiveness of prompt-based debiasing techniques.
A. Pagliaro (Tue,) studied this question.