What question did this study set out to answer?

This research aims to evaluate how well ChatGPT agrees with burn surgeons in assessing the severity of burn injuries.

April 10, 2026Open Access

Evaluating ChatGPT’s Potential as a Burn Evaluation Tool in Acute Care Settings

Key Points

This research aims to evaluate how well ChatGPT agrees with burn surgeons in assessing the severity of burn injuries.
Used a dataset of 20 burn images classified by a burn surgeon by depth.
Surveyed burn surgeons and ChatGPT on burn thickness, referral needs, and treatment approaches.
Measured inter-rater reliability using Fleiss’s and Cohen’s Kappa, comparing responses among surgeons and with ChatGPT.
Achieved 80% agreement among the three respondents (Fleiss’s kappa = 0.766 ± 0.046, p<.001).
ChatGPT matched surgeon consensus in 80% of cases with moderate agreement (Cohen’s kappa = 0.413 ± 0.065, p<.001).
Highest alignment for deep partial and full-thickness burns; struggled with superficial classifications.

Abstract

Abstract Introduction ChatGPT is a widely accessible and versatile large language model. It has the potential to aid burn diagnosis by helping assess severity and guide initial care, particularly among first responders or underserved areas with limited access to specialists. Treatment depends on burn depth as a measure of injury severity. This study evaluates how well ChatGPT agrees with burn surgeons in identifying and managing burns to assess its potential in acute care. Methods A dataset of 20 burn images, previously classified by a burn surgeon by depth, was used to survey other burn surgeons and ChatGPT on burn thickness, need for referral, and treatment approach. Inter-rater reliability was measured using Fleiss’s and Cohen’s Kappa and percent agreement, comparing surgeons’ responses, then surgeon consensus to ChatGPT’s responses. Results There was 80% agreement among the three respondents (Fleiss’s kappa = 0.766 ± 0.046, p.001), indicating substantial agreement. ChatGPT matched the consensus in 80% of cases (n = 48) with moderate but significant agreement (Cohen’s kappa = 0.413 ± 0.065, p.001). Alignment was highest for deep partial (100%; n = 1) and full-thickness burns (86%; n = 6), and lowest for superficial burns (0%). For burn center referrals, agreement was 88% (n = 14) when indicated and 75% (n = 3) when not. ChatGPT and respondents fully agreed on conservative treatment (100%; n = 9) and showed 73% agreement (n = 8) for operative cases. Conclusions ChatGPT aligned best with burn surgeons on clearly defined burn depths like deep partial and full thickness but struggled with superficial classifications. It performed well in identifying burns needing conservative treatment. Applicability of Research to Practice ChatGPT’s strong agreement with burn surgeons supports its potential as a low-cost, accessible tool for burn evaluation in prehospital and acute care settings. Funding for the study: N/A.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Francesco M. Egro

Alexis Henderson

Hilary Liu

Journals

Journal of Burn Care & Research

Actions

Institutions

University of Pittsburgh Medical Center

Mercy Medical Center

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating ChatGPT’s Potential as a Burn Evaluation Tool in Acute Care Settings

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study