March 3, 2026Open Access

A large-scale benchmark for evaluating large language models on medical question answering in Romanian

Puntos clave

Fine-tuned models show superior performance in medical question answering compared to zero-shot models, highlighting the need for specialized adaptation.
Our benchmark includes 105,880 QA pairs focused on cancer patients, allowing for a thorough evaluation of model capabilities.
The analysis evaluates multiple language models in both zero-shot and fine-tuning scenarios, testing their responses to medical queries.
Findings indicate that domain-specific fine-tuning is crucial for enhancing the reliability of clinical QA in Romanian.

Resumen

We introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs about cancer patients from two medical centers. The questions regard medical case summaries of 1242 patients, requiring both keyword extraction and reasoning. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios: zero-shot prompting and supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Rogoz et al. (Sat,) studied this question.

www.synapsesocial.com/papers/69a76175c6e9836116a2f779 — DOI: https://doi.org/10.1038/s41746-026-02465-0

Authors

Ana-Cristina Rogoz

Radu Tudor Ionescu

Alexandra-Valentina Anghel

Journals

npj Digital Medicine

Actions

Institutions

University of Bucharest

Carol Davila University of Medicine and Pharmacy

Clinical Emergency Hospital Bucharest

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A large-scale benchmark for evaluating large language models on medical question answering in Romanian

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion