What question did this study set out to answer?

This research aims to assess the performance of large language models compared to traditional methods in clinical prediction tasks.

April 10, 2026Open Access

ClinicRealm: Re-evaluating large language models with conventional machine learning for non-generative clinical prediction tasks

Key Points

This research aims to assess the performance of large language models compared to traditional methods in clinical prediction tasks.
Benchmark analysis of 15 GPT-style and 5 BERT-style models
Evaluation of 11 conventional machine learning methods
Use of unstructured clinical notes and structured Electronic Health Records for testing
Focus on predictive performance, reasoning, and fairness metrics
Leading zero-shot LLMs outperform fine-tuned BERT models on clinical notes
Advanced LLMs show strong zero-shot performance in data-scarce environments
Open-source LLMs perform comparably to proprietary models in clinical tasks

Abstract

Abstract Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility for non-generative clinical prediction is under-evaluated, and they are often assumed to be inferior to specialized models, creating potential for misuse and misunderstanding. To address this, our ClinicRealm benchmark systematically evaluates 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR) across predictive performance, reasoning, fairness, etc. Our findings reveal a significant shift: on clinical notes, leading zero-shot LLMs (e.g., DeepSeek-V3.1-Think, GPT-5) now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs demonstrate potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Notably, leading open-source LLMs match or exceed their proprietary counterparts. This provides compelling evidence that modern LLMs are competitive tools for clinical prediction, necessitating a re-evaluation of model selection strategies by health data scientists and developers.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhu et al. (Wed,) studied this question.

www.synapsesocial.com/papers/69d895d86c1944d70ce06f67 — DOI: https://doi.org/10.1038/s41746-026-02539-z

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study· 2024 · 105 citations
Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification· 2024 · 33 citations
A comprehensive benchmark for COVID-19 predictive modeling using electronic health records in intensive care· 2024 · 17 citations
DeepSeek-V3 Technical Report· 2024 · 222 citations

Authors

Yinghao Zhu

Junyi Gao

Zixiang Wang

Journals

npj Digital Medicine

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

ClinicRealm: Re-evaluating large language models with conventional machine learning for non-generative clinical prediction tasks

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion