March 1, 2024Open Access

Can large language models reason about medical questions?

Key Points

Key points are not available for this paper at this time.

Abstract

Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Valentin Liévin

Christoffer Hother

Andreas Geert Motzfeldt

Journals

Patterns

Actions

Institutions

University of Copenhagen

Technical University of Denmark

Rigshospitalet

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Can large language models reason about medical questions?

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider