What question did this study set out to answer?

This research aims to improve the analysis of unstructured medical transcription data using semantic similarity models.

April 30, 2026Open Access

A Semantic Text Similarity Model for Predicting Medical Transcription Data

Key Points

This research aims to improve the analysis of unstructured medical transcription data using semantic similarity models.
Evaluated two models: LSTM-RNN and Artificial Neural Network for semantic text similarity classification.
Utilized an annotated dataset of 3,048 medical question pairs from Hugging Face.
Conducted experiments in TensorFlow for 100 epochs with an 80:20 train-validation split.
LSTM-RNN with data augmentation achieved validation accuracy of 95.31%, significantly higher than non-augmented models.
ANN validation accuracy was 82.87% with augmentation, compared to 50.49% without it.
Performance metrics included accuracy, AUC, precision, recall, and F1-score, highlighting the LSTM-RNN's superiority.

Abstract

Background The health sector faces challenges in analysing unstructured medical transcription data, particularly in identifying semantic similarities between clinical question pairs for information retrieval. A major challenge is that it is not feasible to obtain sufficiently large and representative data for specialised machine learning models due to privacy policies. Data augmentation could help alleviate these challenges and, therefore, requires investigation. Methods This study investigated two models, Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) and Artificial Neural Network (ANN), for semantic text similarity classification. An annotated dataset of 3,048 medical question pairs from Hugging Face was employed. The LSTM-RNN model was developed using an Embedding layer and an LSTM layer, whereas the ANN model was developed without these layers. Experiments were conducted in TensorFlow for 100 epochs with an 80:20 train-validation split. Results Six performance metrics were recorded: Accuracy, Binary Cross-Entropy Loss, Area Under the Curve (AUC), Precision, Recall, and F1-Score. The augmented LSTM-RNN significantly outperformed other configurations, achieving a validation accuracy of 95.31%, a loss of 0.2931, an AUC of 97.39%, a precision of 94.69%, a recall of 96.21 %, and an F1-score of 95.44%. Without augmentation, the LSTM-RNN validation accuracy dropped to 58.03%. The augmented ANN achieved a validation accuracy of 82.87%, while the non-augmented ANN struggled with 50.49% accuracy. Conclusions The inclusion of LSTM and Embedding layers allowed the LSTM-RNN to capture contextual dependencies that the ANN could not. The results demonstrate that data augmentation is important for achieving high-performance metrics in clinical text analysis, where data is limited.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Daniel A Folorunso

Adio T Akinwale

Alaba O Adejimi

Journals

Cureus Journal of Computer Science.

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Semantic Text Similarity Model for Predicting Medical Transcription Data

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider