What question did this study set out to answer?

This research examines whether the degradation in encoding fidelity for certain languages is due to the choice of embedding models.

April 10, 2026Open Access

Measurement Matters: Embedding Model Choice Determines Encoding Fidelity Assessment in Multilingual Clinical AI

Key Points

This research examines whether the degradation in encoding fidelity for certain languages is due to the choice of embedding models.
Used 15-sentence clinical battery across 6 different embedding models.
Compared performance of monolingual and language-agnostic models.
Assessed encoding fidelity and variance amplification across multiple models.
EFI varies significantly based on embedding model used, with Kannada measuring 0.081 for MiniLM and 0.853 for LaBSE.
The EFI disparity for Dravidian languages nearly vanishes with LaBSE, indicating improved measurement accuracy.
Indic-specific masked models are ineffective for measuring EFI due to poor performance.
Variance amplification persists across five models, confirming it is intrinsic to large language models.

Abstract

Paper 8 of the MCH Research Program demonstrated ~90% Encoding Fidelity Index (EFI) degradation for Kannada, Tamil, and Hindi clinical input relative to English, measured using standard sentence embedding models (MiniLM, MPNet). This paper asks whether that finding is an artefact of measurement instrument choice. Using the same 15-sentence clinical battery across 6 embedding models ranging from monolingual-distilled (all-MiniLM-L6-v2) to language-agnostic (LaBSE), we find: (1) EFI is dramatically embedding-dependent — Kannada EFI ranges from 0.081 (MiniLM) to 0.853 (LaBSE), a 10× difference for identical input; (2) the Dravidian-specific EFI gap from Paper 8 nearly disappears under LaBSE (Indic–European gap: MiniLM = 0.33, LaBSE = 0.035; 90% reduction), indicating near script-invariance with the appropriate measurement tool; (3) Indic-specific masked language models (MuRIL, IndicBERTv2) are degenerate as sentence encoders — effective rank 1.42/10 versus 9.88/10 for MiniLM — and cannot measure EFI; any claim that "Indic models fix encoding" based on raw MLM embeddings is methodologically invalid; (4) critically, variance amplification persists across ALL five non-degenerate embedding models (Kannada VR > 1.0, p < 0.05), confirming it is LLM-intrinsic rather than a measurement artefact; (5) EFI and variance ratio are statistically independent under LaBSE (r = -0.18, p = 0.73), confirming they measure distinct phenomena with different causes. The central conclusion: LaBSE fixes the measured EFI gap (tokenizer/embedding problem, partially addressable); LaBSE does not fix variance amplification (LLM behaviour problem, requiring different intervention). These findings have direct implications for methodology in multilingual clinical AI evaluation.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

M M LAXMAN

Actions

Institutions

Government Dental College & Research Institute

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Measurement Matters: Embedding Model Choice Determines Encoding Fidelity Assessment in Multilingual Clinical AI

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study