What question did this study set out to answer?

To introduce and evaluate the Moral Consistency Variance (MCV) metric for assessing decision stability in language models under perturbations.

April 16, 2026Open Access

Moral Consistency Variance: A Pilot Benchmark for Decision Stability under Moral Prompt Perturbations in Large Language Models

Key Points

To introduce and evaluate the Moral Consistency Variance (MCV) metric for assessing decision stability in language models under perturbations.
Developed MCV to measure stability in decision distributions across rephrased moral dilemmas.
Evaluated two non-reasoning models and one AI model from Google on a set of 50 synthetic dilemmas.
Reported Kullback-Leibler divergence scores and decision flip rates as performance metrics.
Grok models demonstrated low mean MCV and zero decision flips, indicating high decision stability.
Gemini model showed higher MCV but lower flip rate compared to control model SmolLM2-360M-Instruct.
Statistical tests confirmed significant differences in model performance on the pilot benchmark.

Abstract

Static moral question-answering benchmarks do not test whether a model's decision distribution remains stable when the same dilemma is rephrased without changing the underlying facts. This paper introduces Moral Consistency Variance (MCV), a pilot benchmark metric that measures the average Kullback-Leibler divergence between a baseline binary decision distribution and the distributions induced by prompt perturbations that keep the scenario text and action options fixed. To contextualize the directional KL-based score, we also report Jensen-Shannon divergence (JSD) as a symmetric baseline and decision flip rate as a categorical companion metric. The benchmark contains 50 synthetic dilemmas across ten moral themes, with five perturbation wrappers per scenario. We evaluate two xAI non-reasoning models, one Google Gemini model on Vertex AI, and one open-weight instruct model. A manual audit of 20 randomly sampled prompt pairs found that the perturbation wrappers preserved the scenario facts and action options in all sampled cases, while potentially still changing discourse emphasis. Across the shared 50 -scenario set, grok-4-fast-non-reasoning and grok-4-1-fast-non-reasoning showed low mean MCV (2. 22 10^-3 and 1. 38 10^-3) with zero observed decision flips; gemini-2. 5-flash showed a higher mean MCV (2. 27 10^-2) with a lower mean flip rate (0. 012) than the open-weight control SmolLM2-360M-Instruct, which reached 5. 89 10^-3 MCV with a mean flip rate of 0. 292. Bootstrap confidence intervals and paired Wilcoxon tests indicate that the model-level differences are statistically detectable on this pilot set. We interpret these results as evidence that MCV can reveal promptconditioned distributional instability that categorical agreement alone would miss. The current evidence supports "moral mimicry" only as an interpretive hypothesis, not as an established mechanism.

Bookmark

View Full Paper

Bookmark

View Full Paper

Moral Consistency Variance: A Pilot Benchmark for Decision Stability under Moral Prompt Perturbations in Large Language Models

Key Points

Abstract

Cite This Study