Accurate prediction of the Langelier Saturation Index (LSI), an indicator of water’s scaling and corrosive potential, is vital for water treatment and infrastructure maintenance. In this study, five machine learning models (Ridge Regression, Support Vector Machine, Random Forest, Deep Neural Network, and XGBoost) were applied to predict the LSI from physicochemical characteristics of groundwater in the Morava River basin (Serbia). Rigorous data preprocessing (outlier removal, missing data handling, z-score normalization) and feature selection were performed to ensure robust model training. Models were optimized via 10-fold cross-validation on a 70/30 train–test split. All models achieved high predictive accuracy, with ensemble methods outperforming others. XGBoost yielded the best performance (R2 = 0.98; RMSE = 0.06), followed closely by Random Forest (R2 = 0.95). The linear Ridge model showed the lowest (yet still strong) performance (R2 = 0.90) and larger errors at extreme LSI values. Feature importance analysis consistently identified pH as the most influential predictor of the LSI, followed by alkalinity and calcium. Partial dependence plots confirmed that the models captured established nonlinear LSI behavior. The LSI rises steeply with increasing pH and moderately with mineral content. Overall, this comparative study demonstrates that modern machine learning models can predict the LSI accurately, providing interpretable insights through feature importance and dependence plots. These results underscore the potential of data-driven approaches to complement traditional water stability indices for proactive water quality management.
Vesković et al. (Tue,) studied this question.