What question did this study set out to answer?

This research aims to quantify the spillover effects of bias mitigation techniques on large language models.

January 14, 2026Open Access

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Key Points

This research aims to quantify the spillover effects of bias mitigation techniques on large language models.
Evaluated four bias mitigation techniques across ten models from seven families.
Measured impacts on biases related to race, religion, profession, and gender using the StereoSet benchmark.
Conducted 160 experiments with 640 evaluations to assess collateral effects.
Targeted interventions caused collateral degradation in model coherence and performance 31.5% of the time.
Single-target evaluations often conceal harmful spillover effects.
A multi-dimensional auditing framework is necessary for effective bias mitigation.

Abstract

Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful outputs. While various techniques aim to mitigate these biases, their effects are typically evaluated only along the targeted dimension, leaving cross-dimensional consequences unexplored. This work provides the first systematic quantification of cross-category spillover effects in LLM bias mitigation. We evaluate four bias mitigation techniques (Logit Steering, Activation Patching, BiasEdit, Prompt Debiasing) across ten models from seven families, measuring impact on racial, religious, profession-, and gender-related biases using the StereoSet benchmark. Across 160 experiments yielding 640 evaluations, we find that targeted interventions cause collateral degradations to model coherence and performance along debiasing objectives in 31.5% of untargeted dimension evaluations. These findings provide empirical evidence that debiasing improvements along one dimension can come at the cost of degradation in others. We introduce a multi-dimensional auditing framework and demonstrate that single-target evaluations mask potentially severe spillover effects, underscoring the need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Shireen Chand

Faith Baca

Emilio Ferrara

Journals

Actions

Institutions

University of Southern California

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider