February 8, 2024Open Access

Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study

Key Points

Key points are not available for this paper at this time.

Abstract

Background The accumulation of vast electronic medical records (EMRs) through medical informatization creates significant research value, particularly in obstetrics. Diagnostic standardization across different health care institutions and regions is vital for medical data analysis. Large language models (LLMs) have been extensively used for various medical tasks. Prompt engineering is key to use LLMs effectively. Objective This study aims to evaluate and compare the performance of LLMs with various prompt engineering techniques on the task of standardizing obstetric diagnostic terminology using real-world obstetric data. Methods The paper describes a 4-step approach used for mapping diagnoses in electronic medical records to the International Classification of Diseases, 10th revision, observation domain. First, similarity measures were used for mapping the diagnoses. Second, candidate mapping terms were collected based on similarity scores above a threshold, to be used as the training data set. For generating optimal mapping terms, we used two LLMs (ChatGLM2 and Qwen-14B-Chat QWEN) for zero-shot learning in step 3. Finally, a performance comparison was conducted by using 3 pretrained bidirectional encoder representations from transformers (BERTs), including BERT, whole word masking BERT, and momentum contrastive learning with BERT (MC-BERT), for unsupervised optimal mapping term generation in the fourth step. Results LLMs and BERT demonstrated comparable performance at their respective optimal levels. LLMs showed clear advantages in terms of performance and efficiency in unsupervised settings. Interestingly, the performance of the LLMs varied significantly across different prompt engineering setups. For instance, when applying the self-consistency approach in QWEN, the F1-score improved by 5%, with precision increasing by 7.9%, outperforming the zero-shot method. Likewise, ChatGLM2 delivered similar rates of accurately generated responses. During the analysis, the BERT series served as a comparative model with comparable results. Among the 3 models, MC-BERT demonstrated the highest level of performance. However, the differences among the versions of BERT in this study were relatively insignificant. Conclusions After applying LLMs to standardize diagnoses and designing 4 different prompts, we compared the results to those generated by the BERT model. Our findings indicate that QWEN prompts largely outperformed the other prompts, with precision comparable to that of the BERT model. These results demonstrate the potential of unsupervised approaches in improving the efficiency of aligning diagnostic terms in daily research and uncovering hidden information values in patient data.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Wang et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68e7b298b6db64358770d9ee — DOI: https://doi.org/10.2196/53216

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Physicians’ Perceptions of Chatbots in Health Care: Cross-Sectional Web-Based Survey· 2019 · 406 citations
Automated clinical coding using off-the-shelf large language models· 2023 · 4 citations
Evaluation: from precision, recall and F-measure to ROC, informedness,\n markedness and correlation· 2020 · 1,554 citations
GLM: General Language Model Pretraining with Autoregressive Blank Infilling· 2021 · 21 citations

Authors

Lei Wang

Wenshuai Bi

Suling Zhao

Journals

JMIR Formative Research

Actions

Institutions

BGI Group (China)

The People's Hospital of Guangxi Zhuang Autonomous Region

BGI Research

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion