June 3, 2026Open Access

Benchmarking of Ensembles and Meta-Ensembles in the Multiclass Classification of Obesity-Status Classification: Predictive Performance, Calibration and Interpretability

Key Points

Key points are not available for this paper at this time.

Abstract

Obesity is a major public health concern because of its high prevalence and association with cardiometabolic comorbidities. This study compared nine ensemble and meta-ensemble learning models for multiclass obesity-status classification using the Obesity Dataset, comprising 1610 records, 14 predictors, and four body-weight status classes. To ensure a leakage-aware evaluation, all preprocessing and resampling steps were embedded within the validation workflow. Standardization, one-hot encoding, and RandomOverSampler were applied only within the training folds; SMOTE and no-resampling configurations were retained as configurable alternatives but were not used to generate the reported results. Model performance was assessed using complementary classification, discrimination, agreement, and calibration metrics, including accuracy, balanced accuracy, weighted F1-score, macro F1-score, weighted ROC-AUC, Matthews correlation coefficient, Brier score, and multiclass expected calibration error. Overall, the ensemble models achieved strong discriminative performance, with eight of nine classifiers exceeding 82% accuracy and obtaining weighted ROC-AUC values close to or above 94%. LightGBM showed the strongest mean metric-based profile, with an accuracy of 85.41 ± 2.85%, weighted F1-score of 85.25 ± 2.88%, weighted ROC-AUC of 95.58 ± 1.52%, and MCC of 0.779 ± 0.042. Random Forest and Stacking achieved comparable classification performance, although Stacking presented poorer calibration. The Friedman test detected significant global differences among classifiers, χ2 = 38.7733, p = 0.000005. However, the Nemenyi post hoc test indicated that Stacking, Random Forest, LightGBM, Voting, Gradient Boosting, and Extra Trees belonged to the same high-performance statistical group. Therefore, LightGBM was selected as the final model based on its practical balance of predictive performance, calibration behavior, stability, and implementation feasibility, rather than on unequivocal statistical superiority. On the independent holdout set, LightGBM maintained strong generalization, achieving accuracy = 0.8447, weighted F1-score = 0.8435, MCC = 0.7653, and weighted ROC-AUC = 0.9464. Calibration was moderate, with Brier score = 0.2575 and multiclass ECE = 0.1070, indicating that predicted probabilities should be interpreted cautiously when used to support threshold-based decisions.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Daniel Andrade-Girón

Universidad Nacional José Faustino Sánchez Carrión

William Marin-Rodriguez

Universidad Nacional José Faustino Sánchez Carrión

Américo Peña

Universidad Nacional José Faustino Sánchez Carrión

Journals

Informatics

Actions

Institutions

Universidad Nacional José Faustino Sánchez Carrión

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Benchmarking of Ensembles and Meta-Ensembles in the Multiclass Classification of Obesity-Status Classification: Predictive Performance, Calibration and Interpretability

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study