In the context of employee attrition prediction using imbalanced tabular data, we propose a reproducible, leakage-aware evaluation protocol and validate it on the IBM HR Attrition dataset. We perform the train/test split prior to any rebalancing; SMOTE (Synthetic Minority Over-sampling Technique) is applied exclusively within the training portion of each fold in stratified 5-fold cross-validation, while the test set remains untouched. One-Hot Encoding is performed consistently using pd. getdummies. We benchmark Logistic Regression, Random Forest, ExtraTrees, LightGBM, and XGBoost using imbalance-aware metrics: F1 for the minority class, PR-AUC reported as Average Precision (AP), and ROC-AUC reported both in cross-validation and on the held-out test set. XGBoost attains the best mean AP in cross-validation (0. 556 ± 0. 056). Logistic Regression achieves the highest mean F1 (0. 439 ± 0. 048), while LightGBM yields the best mean ROC-AUC (0. 791 ± 0. 026). On the test set, XGBoost achieves a precision value of 0. 65 and a recall value of 0. 45 at a fixed threshold of 0. 5. Overall, the results highlight a trade-off between stable minority-class detection (Logistic Regression) and stronger risk ranking performance (boosting models) under class imbalance.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ana Maria Căvescu
Alina Nirvana Popescu
Information
Universitatea Națională de Știință și Tehnologie Politehnica București
Building similarity graph...
Analyzing shared references across papers
Loading...
Căvescu et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69c37be2b34aaaeb1a67ebef — DOI: https://doi.org/10.3390/info17030308
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: