What question did this study set out to answer?

To examine the quality of ChatGPT-3.5 and ChatGPT-4.0 responses to common questions about scoliosis.

April 10, 2026Open Access

An Assessment of GPT-3.5 and GPT-4.0 Responses to Scoliosis FAQs

Key Points

To examine the quality of ChatGPT-3.5 and ChatGPT-4.0 responses to common questions about scoliosis.
Selected 10 frequently asked scoliosis questions from a pool of over 250.
Conducted expert reviews to harmonize and finalize questions based on themes.
Rated chatbot responses using a scale for clarity and comprehension.
Median ratings for both models ranged from satisfactory to needing moderate clarification.
No statistically detectable difference in performance was observed between the two models.
Findings indicate that both models frequently required clarification from clinicians.

Abstract

Background: ChatGPT is a large language model (LLM) online chatbot developed by OpenAI and launched in November 2022. Early adoption studies have shown high readiness to use this technology for health-related questions and self-diagnosis. However, the quality and clinical adequacy of health-related responses remain incompletely characterized. This study aimed to explore responses generated by ChatGPT-3.5 and ChatGPT-4.0 to common patient questions regarding scoliosis. Methods: Ten scoliosis-related frequently asked questions (FAQs) were selected from a larger pool of over 250 patient-facing questions compiled from 17 publicly available FAQ webpages and informed by a Google Trends analysis. Questions were harmonized, grouped by theme, and then reduced by rule-based expert review to a final set intended to represent common patient concerns. Results: The median ratings of ChatGPT-3.5 and ChatGPT-4.0 responses ranged from satisfactory, requiring minimal (2) to moderate clarification (3). Across the ten matched questions, no statistically detectable difference was found between models in this study setting (W = 8.0, p = 0.59; Cliff’s δ = −0.12 95% CI −0.58, 0.40); however, given the small question set, unblinded rating process, and poor inter-rater reliability, this should not be interpreted as evidence of equivalence, non-inferiority, or comparable model performance. The results apply only to the 10–15 April 2024, online snapshots of ChatGPT-3.5 and ChatGPT-4.0 and should not be generalized to later model iterations. Conclusions: This study should be interpreted as a clinically oriented observational report, intended to inform physician awareness and patient-physician communication rather than validate chatbot accuracy or safety. In this 10–15 April 2024, sample, both model outputs frequently required clinician clarification. Given the small FAQ set, low inter-rater reliability, unblinded design, and single-sample outputs, the findings do not establish equivalence or superiority and apply only to the specific 10–15 April 2024, model snapshots and evaluated questions.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Vu-Han et al. (Tue,) studied this question.

www.synapsesocial.com/papers/69d895206c1944d70ce062ae — DOI: https://doi.org/10.3390/jpm16040206

Authors

Tu-Lan Vu-Han

Enikő Regényi

Vikram Sunkara

Journals

Journal of Personalized Medicine

Actions

Institutions

Cornell University

Charité - Universitätsmedizin Berlin

Hospital for Special Surgery

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

An Assessment of GPT-3.5 and GPT-4.0 Responses to Scoliosis FAQs

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion