Static moral question-answering benchmarks do not test whether a model's decision distribution remains stable when the same dilemma is rephrased without changing the underlying facts. This paper introduces Moral Consistency Variance (MCV), a pilot benchmark metric that measures the average Kullback-Leibler divergence between a baseline binary decision distribution and the distributions induced by prompt perturbations that keep the scenario text and action options fixed. To contextualize the directional KL-based score, we also report Jensen-Shannon divergence (JSD) as a symmetric baseline and decision flip rate as a categorical companion metric. The benchmark contains 50 synthetic dilemmas across ten moral themes, with five perturbation wrappers per scenario. We evaluate two xAI non-reasoning models, one Google Gemini model on Vertex AI, and one open-weight instruct model. A manual audit of 20 randomly sampled prompt pairs found that the perturbation wrappers preserved the scenario facts and action options in all sampled cases, while potentially still changing discourse emphasis. Across the shared 50 -scenario set, grok-4-fast-non-reasoning and grok-4-1-fast-non-reasoning showed low mean MCV (2. 22 10^-3 and 1. 38 10^-3) with zero observed decision flips; gemini-2. 5-flash showed a higher mean MCV (2. 27 10^-2) with a lower mean flip rate (0. 012) than the open-weight control SmolLM2-360M-Instruct, which reached 5. 89 10^-3 MCV with a mean flip rate of 0. 292. Bootstrap confidence intervals and paired Wilcoxon tests indicate that the model-level differences are statistically detectable on this pilot set. We interpret these results as evidence that MCV can reveal promptconditioned distributional instability that categorical agreement alone would miss. The current evidence supports "moral mimicry" only as an interpretive hypothesis, not as an established mechanism.
Qiao Liang (Tue,) studied this question.