May 8, 2026Open Access

Performance of Large Language Models on the Brazilian National Medical Education Examination: Comparative Benchmark Study

Key Points

Key points are not available for this paper at this time.

Abstract

Background Large language models (LLMs) are rapidly incorporated into medical education and examination preparation; yet, most benchmarking evidence is derived from English-language material. Whether frontier commercial models and Brazilian Portuguese domain-specialized systems perform equivalently on high-stakes Brazilian medical examinations remains unclear. Objective This study aims to quantify and compare the performance of 9 frontier commercial LLMs and 1 Brazilian Portuguese domain-specialized system (Charcot, Voa Health) on 2026 Brazilian National Medical Education Examination (Exame Nacional de Avaliação da Formação Médica ENAMED 2026) and to describe the patterns of systematic between-model error as complementary quality signal. Methods All 100 items of ENAMED 2026 (99 valid after annulment) were administered to 10 frontier-panel models across 5 independent runs under identical Portuguese prompts (temperature=0; top-p=.95). Commercial models were accessed through a unified OpenRouter client layer (DeepSeek provider-pinned). The primary outcome was mean accuracy against the preliminary key; the secondary outcomes were convergence error (CE), normalized mean response time (NMRT), and intermodel agreement. Accuracy was analyzed with Shapiro-Wilk, Levene, Kruskal-Wallis (ε2), Dunn-Holm post hoc, and a binomial generalized linear mixed model with question and run random intercepts. NMRT excluded Charcot (different stack) and Grok 4 (latency outlier). A total of 7 open-weight and small language models were assessed as a small language model (SLM) substudy. Results Frontier-panel accuracy ranged from 73.74% (365/495) for GPT-4o-mini to 96.97% (480/495) for Charcot. Accuracy was nonnormal (Shapiro-Wilk, W=0.82; P<.001) with homogeneous variance (Levene P=.26). Kruskal-Wallis showed large between-model differences (H9=47.65; P<.001; ε2=0.97). Dunn-Holm flagged 8 of 45 pairs: Charcot was separable from GPT-4o-mini, DeepSeek v3.2-exp, and Grok 4, but not from the top frontier cluster. The generalized linear mixed model preserved the ranking (all comparators odds ratio<1 vs Charcot; upper CI<1 except GPT-5). A total of 9 items met the default CE criterion; item 77 showed 10-of-10 convergence, later confirmed by Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira rectification, and sensitivity analysis preserved the CE set (5-20 items). Intermodel agreement was high (Fleiss κ=0.852; Krippendorff α=.852). Among 8 retained commercial models, NMRT correlated positively with accuracy (Spearman ρ=0.74; P=.04); these results are interpreted as descriptive of an orchestration-level latency–accuracy law. SLM accuracy ranged from 47.47% (Gemma 3 4B) to 82.22% (GPT-OSS 120B), with lower agreement (Fleiss κ=0.508) and 17 CE items. The best SLM-panel model lagged every frontier-panel model except GPT-4o-mini, with a 12-15 percentage-point gap against the top cluster. Conclusions On ENAMED 2026, a Brazilian Portuguese domain-specialized system ranked first, indistinguishable from frontier commercial cluster and above subfrontier and open-weight systems. Charcot’s architecture is not publicly disclosed; these findings should be interpreted as comparative black-box evidence of performance and not as mechanistic evidence of specialization. CE was stable, and it prospectively flagged 1 rectified item, supporting its use as a quality assurance screen.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper