Accurate estimation of battery State of Health (SOH) is critical for ensuring the reliability, safety, and sustainability of electric vehicles and second-life energy storage systems. Developing machine learning (ML) models for pack-level SOH prediction remains challenging due to the limited availability and high heterogeneity of real End-of-Life (EoL) battery data. This study analyzes how dataset size and statistical balance influence the generalization capability and data efficiency of ensemble ML models trained on industrial pack-level measurements. Using controlled discharge and recovery tests from reconditioned battery packs, three ensemble regressors, Random Forest, Gradient Boosting, and CatBoost, were evaluated through learning-curve analysis and power-law modeling. CatBoost achieved the highest accuracy and robustness (RMSE = 0.0145), while Gradient Boosting exhibited stronger scalability with increasing data volume. Dataset balancing improved representational fairness across degradation modes but yielded limited gains in absolute accuracy, confirming that predictive performance is primarily constrained by data quantity. Extrapolated scaling trends indicate that doubling the dataset would yield to a marginal 0.1% improvement in SOH accuracy, making large-scale data expansion economically inefficient. The proposed framework provides a quantitative foundation for assessing data efficiency and guiding cost-effective ML model validation in industrial SOH estimation pipelines.
Calì et al. (Thu,) studied this question.