What question did this study set out to answer?

This research aims to evaluate different machine learning models and imbalance-handling methods for detecting rare failures in predictive maintenance.

April 10, 2026Open Access

A Comparative Study of Imbalance-Handling Methods in Multiclass Predictive Maintenance

Key Points

This research aims to evaluate different machine learning models and imbalance-handling methods for detecting rare failures in predictive maintenance.
Compared five machine learning models: random forest, XGBoost, support vector machine, k-nearest neighbors, and multinomial logistic regression.
Evaluated four imbalance-handling approaches: no handling, manual oversampling, selective manual oversampling, and class weighting.
Used the AI4I 2020 predictive maintenance dataset with five failure types to test model performance.
Formed 20 configurations to assess the effectiveness of different models and handling methods.
XGBoost with no handling achieved the highest macro-averaged F1 score of 0.842 but had 0% recall for tool wear failure.
MLR with selective manual oversampling achieved approximately 50% recall for tool wear failure with a lower macro-F1 score of 0.636.
Detection of very rare classes remains challenging, indicating that high-performing models cannot consistently detect all failure types.

Abstract

Predictive maintenance plays a key role in digitalization initiatives; however, in real settings, issues related to failure prediction occur when failure instances are rare compared to normal instances, leading to class imbalance. In this study, we systematically compare five machine learning (ML) models—random forest, XGBoost, support vector machine, k-nearest neighbors, and multinomial logistic regression (MLR)—to detect multiclass rare failures using four imbalance-handling approaches (i.e., no handling, manual oversampling, selective manual oversampling, and class weighting), forming 20 configurations. Using the AI4I 2020 predictive maintenance dataset, which contains five failure types, we determined that XGBoost with no handling achieved the highest macro-averaged F1 (macro-F1) score (0.842) but obtained 0% recall for tool wear failure (TWF). MLR with selective manual oversampling achieved approximately 50% TWF recall with lower overall performance (0.636 macro-F1) than top-performing models such as XGBoost. We also found that very rare classes remain difficult to detect. Even high-performing models fail to consistently detect all five failure types. Overall, no single strategy can achieve a high detection rate across all performance measures.

A Comparative Study of Imbalance-Handling Methods in Multiclass Predictive Maintenance

Key Points

Abstract

Cite This Study