Background: Large language models, such as OpenAI's ChatGPT, have the potential to revolutionize patient education. These platforms allow for a vast collection, analysis, and organization of information largely unavailable to online users. Within medicine, these tools could help complement physicians in better educating patients on complex and routine medical information. Currently, limited literature exists on the reliability of such tools to provide high-quality information to patients inquiring about gender-affirming top surgery. Therefore, this study aimed to evaluate ChatGPT's performance when generating patient-level information on gender-affirming top surgery compared with the current online information provided by the American Society of Plastic Surgery (ASPS) using the Modified Ensuring Quality Information for Patients (mEQIP) tool. Methods: ChatGPT-4-generated patient-level education on transmasculine and transfeminine gender-affirming top surgery was compared against current online content provided by the ASPS. ChatGPT-4 patient content was generated by individually formatting standardized mEQIP content items to incorporate the topic of gender-affirming top surgery into ChatGPT-4, with responses recorded for each item. Four experts in gender-affirming top surgery independently rated both sources using a 36-item mEQIP tool. Paired t-tests comparing overall and content-specific mEQIP scores of the ChatGPT-4 and ASPS material were then estimated to measure the quality of the content. The effect size between the two groups was evaluated using Cohen's d. Lastly, Cronbach’s alpha and ICC (Intraclass Correlation Coefficient) were calculated to measure internal consistency among raters and interrater agreement. Results: When analyzing ChatGPT-4 and ASPS patient material, paired t-tests showed a statistically significant increase in overall mEQIP scores for ChatGPT with a mean difference of 7.50 (CI 6.75-8.25; p<0.001). For the mEQIP content-specific scores, a paired t-test revealed a similarly significant increase in ChatGPT scores with a mean difference of 9.75 (CI 9.26-10.24; p<0.001). When evaluating the effect size, a paired Cohen's d value of 13.00 was calculated, demonstrating a statistically significant difference in magnitude between the two groups. To measure internal consistency among raters and interrater agreement, an ICC and Cronbach's alpha were calculated for both ASPS and ChatGPT. The ASPS-mEQIP showed good internal consistency and excellent interrater reliability (ICC=0.89, Cronbach's α=0.84), while ChatGPT-mEQIP showed excellent internal consistency and excellent interrater reliability (ICC=0.96, Cronbach's α=0.96). Conclusion: These results demonstrate that ChatGPT-4-generated patient education on gender-affirming top surgery exceeded the current ASPS online content in both overall and content-specific scores, as measured by the mEQIP tool. ChatGPT achieved significantly higher scores across both domains with large effect sizes, and raters demonstrated excellent internal consistency and excellent interrater reliability. Going forward, commonly accessible artificial Intelligences (AIs), such as ChatGPT, may serve as a valuable complement to patient education and shared decision-making within plastic and reconstructive surgery, though future studies are warranted to evaluate freely generated responses to better reflect current AI use.
Building similarity graph...
Analyzing shared references across papers
Loading...
Campos et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69d893406c1944d70ce04503 — DOI: https://doi.org/10.7759/cureus.106607
Adrian O Campos
Mohammed Almeflehi
Sean Kim
Cureus
Building similarity graph...
Analyzing shared references across papers
Loading...