What question did this study set out to answer?

This study evaluates the quality of patient education generated by ChatGPT-4 for gender-affirming top surgery compared to ASPS online information.

April 10, 2026Open Access

Assessing the Quality of Artificial Intelligence (AI)-Generated Patient Education for Gender-Affirming Top Surgery Using the Modified Ensuring Quality Information for Patients (mEQIP) Tool

Key Points

This study evaluates the quality of patient education generated by ChatGPT-4 for gender-affirming top surgery compared to ASPS online information.
Compared ChatGPT-4 and ASPS patient education materials using the mEQIP tool.
Four experts rated the content using a 36-item mEQIP tool for both sources.
Paired t-tests analyzed overall and content-specific mEQIP scores.
Effect size was assessed with Cohen's d and reliability with Cronbach’s alpha and ICC.
ChatGPT-4 showed a significant increase in overall mEQIP scores with a mean difference of 7.50 (p<0.001).
Content-specific scores for ChatGPT-4 also increased significantly, mean difference of 9.75 (p<0.001).
Cohen's d value of 13.00 indicated a large effect size between the two groups.
Both ChatGPT and ASPS content showed excellent interrater reliability and internal consistency (ICC=0.96, Cronbach's α=0.96 for ChatGPT).

Abstract

Background: Large language models, such as OpenAI's ChatGPT, have the potential to revolutionize patient education. These platforms allow for a vast collection, analysis, and organization of information largely unavailable to online users. Within medicine, these tools could help complement physicians in better educating patients on complex and routine medical information. Currently, limited literature exists on the reliability of such tools to provide high-quality information to patients inquiring about gender-affirming top surgery. Therefore, this study aimed to evaluate ChatGPT's performance when generating patient-level information on gender-affirming top surgery compared with the current online information provided by the American Society of Plastic Surgery (ASPS) using the Modified Ensuring Quality Information for Patients (mEQIP) tool. Methods: ChatGPT-4-generated patient-level education on transmasculine and transfeminine gender-affirming top surgery was compared against current online content provided by the ASPS. ChatGPT-4 patient content was generated by individually formatting standardized mEQIP content items to incorporate the topic of gender-affirming top surgery into ChatGPT-4, with responses recorded for each item. Four experts in gender-affirming top surgery independently rated both sources using a 36-item mEQIP tool. Paired t-tests comparing overall and content-specific mEQIP scores of the ChatGPT-4 and ASPS material were then estimated to measure the quality of the content. The effect size between the two groups was evaluated using Cohen's d. Lastly, Cronbach’s alpha and ICC (Intraclass Correlation Coefficient) were calculated to measure internal consistency among raters and interrater agreement. Results: When analyzing ChatGPT-4 and ASPS patient material, paired t-tests showed a statistically significant increase in overall mEQIP scores for ChatGPT with a mean difference of 7.50 (CI 6.75-8.25; p<0.001). For the mEQIP content-specific scores, a paired t-test revealed a similarly significant increase in ChatGPT scores with a mean difference of 9.75 (CI 9.26-10.24; p<0.001). When evaluating the effect size, a paired Cohen's d value of 13.00 was calculated, demonstrating a statistically significant difference in magnitude between the two groups. To measure internal consistency among raters and interrater agreement, an ICC and Cronbach's alpha were calculated for both ASPS and ChatGPT. The ASPS-mEQIP showed good internal consistency and excellent interrater reliability (ICC=0.89, Cronbach's α=0.84), while ChatGPT-mEQIP showed excellent internal consistency and excellent interrater reliability (ICC=0.96, Cronbach's α=0.96). Conclusion: These results demonstrate that ChatGPT-4-generated patient education on gender-affirming top surgery exceeded the current ASPS online content in both overall and content-specific scores, as measured by the mEQIP tool. ChatGPT achieved significantly higher scores across both domains with large effect sizes, and raters demonstrated excellent internal consistency and excellent interrater reliability. Going forward, commonly accessible artificial Intelligences (AIs), such as ChatGPT, may serve as a valuable complement to patient education and shared decision-making within plastic and reconstructive surgery, though future studies are warranted to evaluate freely generated responses to better reflect current AI use.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Campos et al. (Tue,) studied this question.

www.synapsesocial.com/papers/69d893406c1944d70ce04503 — DOI: https://doi.org/10.7759/cureus.106607

Authors

Adrian O Campos

Mohammed Almeflehi

Sean Kim

Journals

Cureus

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Assessing the Quality of Artificial Intelligence (AI)-Generated Patient Education for Gender-Affirming Top Surgery Using the Modified Ensuring Quality Information for Patients (mEQIP) Tool

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion