What question did this study set out to answer?

This research aims to analyze the incidence of diagnosed diabetes at the county level and the impact of sociodemographic factors using machine learning.

March 30, 2026Open Access

Machine Learning Models to Evaluate County-level Incidence of Diagnosed Diabetes and Sociodemographic Factors

Key Points

This research aims to analyze the incidence of diagnosed diabetes at the county level and the impact of sociodemographic factors using machine learning.
Used US CDC data from 2004−2019 on diabetes incidence across counties.

Structured PICO

Population

3,114 US counties with data on diagnosed diabetes incidence (2004-2019) and 34 sociodemographic factors

Intervention

Machine learning models (elastic net regression, extreme gradient boosting [XGBoost], support vector machine [SVM])

Outcome

Model performance for estimating diabetes incidence and classifying higher-burden counties (incidence >12.6 per 1000 persons)

Machine learning models demonstrated high discrimination in identifying US counties with a high burden of diabetes using sociodemographic factors, highlighting variables like children living with grandparent householders as key predictors.

Abstract

: To evaluate county-level incidence of diagnosed diabetes and key sociodemographic factors in a high-dimensional, nonlinear setting. : This temporally aggregated observational study used US CDC data on county-level incidence of diagnosed diabetes, from 2004−2019, and 34 sociodemographic factors from public databases. We defined counties as higher-burden if diabetes incidence was >12.6 per 1000 persons (1 standard deviation SD above sample mean). As relationships between sociodemographic factors and diabetes incidence may be nonlinear and involve complex interactions, we trained three machine learning models to estimate incidence (elastic net regression), classify counties as higher-burden (extreme gradient boosting XGBoost, support vector machine SVM), and identify feature importance. Model performance was evaluated using 5-fold cross-validation, with stratified folds for XGBoost and SVM models. : Overall, 500 of 3114 counties (16.1%) were of higher-burden. Elastic net regression showed good predictive performance for estimating diabetes incidence (R 2 0.78 95% CI, 0.75–0.80). For classification of higher-burden counties, SVM and XGBoost showed high discrimination with AUROC of 0.962 (95% CI, 0.948–0.974) and 0.957 (95% CI, 0.941–0.971), respectively. Sensitivity analyses using alternative definitions of higher-burden counties (mean + 0.75×SD; mean + 1.25×SD) yielded comparable results. Across all three models, key county-level features contributing to model predictions were percentages of children living with grandparent householders and of people with Limited English . : Machine learning models demonstrated consistent performance in estimating and classifying county-level diabetes incidence, with high discrimination for identifying higher-burden counties. Sociodemographic factors, including children living with grandparent householders , may inform tailored public health interventions.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Alexander S. Keigley

Shant Ayanian

SAGAR DUGANI

Journals

American Journal of Medicine Open

Actions

Institutions

Mayo Clinic in Arizona

Mayo Clinic in Florida

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Machine Learning Models to Evaluate County-level Incidence of Diagnosed Diabetes and Sociodemographic Factors

Key Points

Structured PICO

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study