Transformer-based models have shown strong potential for clinical prediction using electronic health record data, yet their performance can vary depending on modelling decisions and data characteristics. In this study, we trained a BEHRT model on hospital-based UK Biobank data and evaluated its performance across four clinical prediction tasks, including next-visit diagnosis and longer-term diagnosis prediction up to five years. We exhaustively assessed the impact of model size, medical terminology (CALIBER vs ICD-10), and data split strategies. The large model consistently outperformed the smaller one in long-term prediction tasks (AUROC = 0.874 vs 0.858 at 5 years), while differences were marginal in 6-months prediction tasks. Performance was also sensitive to the vocabulary size, with CALIBER model yielding higher average precision scores (Average Precision Score = 0.773 vs 0.678 using ICD-10). Our results show that transformer models can achieve high predictive performance across diverse clinical scenarios, but outcomes vary considerably depending on modelling choices, particularly in long-term prediction tasks.
Yildiz et al. (Tue,) studied this question.