March 3, 2026Open Access

Semi-Supervised Learning to Improve Generalizability of Cancer Associated-Venous Thromboembolism Risk Prediction Models

Key Points

Models developed using a semi-supervised learning algorithm achieved better predictive performance than traditional methods.
The overall performance of the machine learning models showed AUC values ranging from 0.816 to 0.868, indicating effective prediction capabilities.
Data was gathered from 2100 cancer patients across both retrospective and prospective cohorts for validation purposes.
Findings support the use of advanced techniques for cancer-associated venous thromboembolism risk assessment, allowing for improved patient care.

Abstract

ObjectiveThe purpose of this study is to develop and validate an improved CA-VTE risk prediction model based on semi-supervised learning (SSL) algorithm.MethodsThis study used a combined retrospective and prospective cohort design. First, data from 2100 cancer patients in a tertiary hospital in Beijing were retrospectively collected, including a "labeled cohort" with CA-VTE outcomes (N = 1036) and an "unlabeled cohort" without outcomes (N = 1064). Then, another dataset were prospectively collected as an external validation set (N = 321). Eight supervised machine learning (ML) algorithms were used to develop CA-VTE risk prediction models and one SSL algorithm was used to improve generalizability of the models (pre- and post-imputation ML models). Model performance were evaluated using the Area Under the Curve (AUC) and Brier score in the prospective cohort, and compare them with the Khorana score.ResultsThe eight post-imputation ML models (AUC: 0.816-0.868; Brier score: 0.118-0.160) performed better on the external validation set than the pre-imputation models (AUC: 0.798-0.841; Brier score: 0.133-0.171). In contrast, the AUC of the Khorana score remained unchanged (AUC: 0.693), while its Brier score increased (Brier score: 0.172 vs 0.178).ConclusionBased on a retrospective and prospective cohort study design, this study developed eight ML models that outperformed the Khorana score. Using SSL algorithm improved the external validation performance of the models and enhanced prediction accuracy. This study can provide an important reference for the early identification of high-risk factors and stratified preventive care for CA-VTE.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Shuai Jin

Chong Wang

Dan Qin

Journals

Clinical and Applied Thrombosis/Hemostasis

Actions

Institutions

Peking University

Capital Medical University

Hebei University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Semi-Supervised Learning to Improve Generalizability of Cancer Associated-Venous Thromboembolism Risk Prediction Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study