Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful outputs. While various techniques aim to mitigate these biases, their effects are typically evaluated only along the targeted dimension, leaving cross-dimensional consequences unexplored. This work provides the first systematic quantification of cross-category spillover effects in LLM bias mitigation. We evaluate four bias mitigation techniques (Logit Steering, Activation Patching, BiasEdit, Prompt Debiasing) across ten models from seven families, measuring impact on racial, religious, profession-, and gender-related biases using the StereoSet benchmark. Across 160 experiments yielding 640 evaluations, we find that targeted interventions cause collateral degradations to model coherence and performance along debiasing objectives in 31.5% of untargeted dimension evaluations. These findings provide empirical evidence that debiasing improvements along one dimension can come at the cost of degradation in others. We introduce a multi-dimensional auditing framework and demonstrate that single-target evaluations mask potentially severe spillover effects, underscoring the need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shireen Chand
Faith Baca
Emilio Ferrara
AI
University of Southern California
Building similarity graph...
Analyzing shared references across papers
Loading...
Chand et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69671985c0d1e3cfbfce8e9a — DOI: https://doi.org/10.3390/ai7010024
Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context: