What question did this study set out to answer?

To evaluate if feedback from a large language model can effectively improve clinical documentation quality compared to traditional physician feedback.

April 17, 2026

Leveraging a Large Language Model to Generate Quality Improvement Feedback for Clinical Notes

Key Points

To evaluate if feedback from a large language model can effectively improve clinical documentation quality compared to traditional physician feedback.
Conducted a cross-sectional study with GPT-4 and physician feedback on inpatient progress notes.
Sampled 64 inpatient progress notes identified as low quality by an AI Audit Tool.
Used A/B testing to assess understandability, usefulness, acceptability, and impartiality via 10-point interval scales.
GPT-4 feedback was non-inferior to physician feedback across all evaluated measures:
Understandability mean of 1.27 (95% CI 0.73 to 1.8, P < 0.001)
Usefulness mean of 2.09 (95% CI 1.27 to 2.91, P < 0.001)
Acceptability mean of 2.07 (95% CI 1.33 to 2.81, P < 0.001)
Impartiality mean of -0.20 (95% CI -0.52 to 0.12, P < 0.001).

Abstract

Background: Poor documentation quality can significantly affect healthcare operations, but the feedback process for clinicians to improve clinical notes is time-consuming and often insufficient. Large language models (LLMs) such as Generative Pre-trained Transformer 4 (GPT-4) have the potential to streamline this process. Objectives: To determine whether an LLM can generate feedback to improve the medical contingency and discharge planning (MCDP) component of clinical documentation that is non-inferior to feedback by physicians. Methods: A cross-sectional study of GPT-4 feedback and physician feedback on inpatient progress notes was conducted. A random sample of 64 inpatient progress notes identified by the validated AI Audit Tool as having a low likelihood of containing MCDP was included from adult general medicine patients hospitalized at New York University Langone Health (NYULH) in December 2023. Both GPT-4 model and attending physicians generated feedback on these inpatient progress notes. A/B testing was then conducted on the measures of understandability, usefulness, acceptability, and impartiality. Evaluations employed 5-point Likert scales that were converted to 10-point bidirectional interval scales for interpretability, ranging from –10 (human suggestions significantly better) to +10 (GPT-4 suggestions significantly better), with a non–inferiority threshold set to –1 for the primary endpoint. Results: 64 inpatient progress notes were included, representing 55% female patients with a median age of 73. GPT-4 feedback was non-inferior to physician feedback in all measures: understandability (mean 1.27, 95% CI 0.73 to 1.8, P < 0.001), usefulness (mean 2.09, 95% CI 1.27 to 2.91, P < 0.001), acceptability (mean 2.07, 95% CI 1.33 to 2.81, P < 0.001), and impartiality (mean –0.20, 95% CI –0.52 to 0.12, P < 0.001). Conclusions: This study shows that an LLM can be leveraged to generate note quality feedback that is non-inferior to expert clinician feedback.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Chris Kim

Joseph Gelfinbein

Nihan Gencerliler

Journals

Applied Clinical Informatics

Actions

Institutions

New York University

NYU Langone Health

Winthrop-University Hospital

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Leveraging a Large Language Model to Generate Quality Improvement Feedback for Clinical Notes

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study