Key points are not available for this paper at this time.
Background Large language models (LLMs) are rapidly incorporated into medical education and examination preparation; yet, most benchmarking evidence is derived from English-language material. Whether frontier commercial models and Brazilian Portuguese domain-specialized systems perform equivalently on high-stakes Brazilian medical examinations remains unclear. Objective This study aims to quantify and compare the performance of 9 frontier commercial LLMs and 1 Brazilian Portuguese domain-specialized system (Charcot, Voa Health) on 2026 Brazilian National Medical Education Examination (Exame Nacional de Avaliação da Formação Médica ENAMED 2026) and to describe the patterns of systematic between-model error as complementary quality signal. Methods All 100 items of ENAMED 2026 (99 valid after annulment) were administered to 10 frontier-panel models across 5 independent runs under identical Portuguese prompts (temperature=0; top-p=.95). Commercial models were accessed through a unified OpenRouter client layer (DeepSeek provider-pinned). The primary outcome was mean accuracy against the preliminary key; the secondary outcomes were convergence error (CE), normalized mean response time (NMRT), and intermodel agreement. Accuracy was analyzed with Shapiro-Wilk, Levene, Kruskal-Wallis (ε2), Dunn-Holm post hoc, and a binomial generalized linear mixed model with question and run random intercepts. NMRT excluded Charcot (different stack) and Grok 4 (latency outlier). A total of 7 open-weight and small language models were assessed as a small language model (SLM) substudy. Results Frontier-panel accuracy ranged from 73.74% (365/495) for GPT-4o-mini to 96.97% (480/495) for Charcot. Accuracy was nonnormal (Shapiro-Wilk, W=0.82; P<.001) with homogeneous variance (Levene P=.26). Kruskal-Wallis showed large between-model differences (H9=47.65; P<.001; ε2=0.97). Dunn-Holm flagged 8 of 45 pairs: Charcot was separable from GPT-4o-mini, DeepSeek v3.2-exp, and Grok 4, but not from the top frontier cluster. The generalized linear mixed model preserved the ranking (all comparators odds ratio<1 vs Charcot; upper CI<1 except GPT-5). A total of 9 items met the default CE criterion; item 77 showed 10-of-10 convergence, later confirmed by Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira rectification, and sensitivity analysis preserved the CE set (5-20 items). Intermodel agreement was high (Fleiss κ=0.852; Krippendorff α=.852). Among 8 retained commercial models, NMRT correlated positively with accuracy (Spearman ρ=0.74; P=.04); these results are interpreted as descriptive of an orchestration-level latency–accuracy law. SLM accuracy ranged from 47.47% (Gemma 3 4B) to 82.22% (GPT-OSS 120B), with lower agreement (Fleiss κ=0.508) and 17 CE items. The best SLM-panel model lagged every frontier-panel model except GPT-4o-mini, with a 12-15 percentage-point gap against the top cluster. Conclusions On ENAMED 2026, a Brazilian Portuguese domain-specialized system ranked first, indistinguishable from frontier commercial cluster and above subfrontier and open-weight systems. Charcot’s architecture is not publicly disclosed; these findings should be interpreted as comparative black-box evidence of performance and not as mechanistic evidence of specialization. CE was stable, and it prospectively flagged 1 rectified item, supporting its use as a quality assurance screen.
Silva et al. (Fri,) studied this question.