Predictive maintenance plays a key role in digitalization initiatives; however, in real settings, issues related to failure prediction occur when failure instances are rare compared to normal instances, leading to class imbalance. In this study, we systematically compare five machine learning (ML) models—random forest, XGBoost, support vector machine, k-nearest neighbors, and multinomial logistic regression (MLR)—to detect multiclass rare failures using four imbalance-handling approaches (i.e., no handling, manual oversampling, selective manual oversampling, and class weighting), forming 20 configurations. Using the AI4I 2020 predictive maintenance dataset, which contains five failure types, we determined that XGBoost with no handling achieved the highest macro-averaged F1 (macro-F1) score (0.842) but obtained 0% recall for tool wear failure (TWF). MLR with selective manual oversampling achieved approximately 50% TWF recall with lower overall performance (0.636 macro-F1) than top-performing models such as XGBoost. We also found that very rare classes remain difficult to detect. Even high-performing models fail to consistently detect all five failure types. Overall, no single strategy can achieve a high detection rate across all performance measures.
Alnahhal et al. (Tue,) studied this question.