What question did this study set out to answer?

The aim is to develop an interpretable machine learning model to predict healthcare data breach risks.

April 15, 2026

Research and Analysis of Healthcare Data Breach Risk Prediction in the US Based on Interpretable Machine Learning

Key Points

The aim is to develop an interpretable machine learning model to predict healthcare data breach risks.
Utilized breach reports from OCR-HHS between 2019-2025
Built models using logistic regression and random forest
Conducted calibration and feature importance analyses
Compared model performance against random baselines
Both models outperform random baselines in predicting breaches
Identified critical elements like attack vectors and information location
Low overall discriminative ability indicated in predictions
Provided actionable insights for early-warning in cybersecurity

Abstract

This research is based on the Office for Civil Rights at the Department of Health and Human Services (OCR-HHS) breach reports (2019-2025) to build an interpretable machine learning model that forecasts incidents with a high impact ( ≥ 100,000 people) and likelihood of a ransomware. They include entity type, breach method and the location of compromised information. A comparative analysis was made between the logistic regression and random forest models to provide transparency and accuracy, calculate the calibration analysis, and feature importance analysis. The anticipated benefits are actionable tiered controls risk scores, improved incident preparedness, and governance decision support of healthcare cybersecurity. Research is based only on publicly available aggregated statistics, but not on patient-related data, which meets professional and regulatory ethics. Findings indicate that the two models are better than random baselines and they can provide noteworthy early-warning information; however, overall, the discriminative ability is low. Critical elements - attack vectors and information location - provide consumable results even to operational security planning. Generally, the findings have shown that interpretable predictions that are data driven can be feasible in reinforcing proactive cybersecurity governance within the healthcare industry.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Mingyang Sun

Yuxin Wu

Rongtian Ye

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Research and Analysis of Healthcare Data Breach Risk Prediction in the US Based on Interpretable Machine Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study