What question did this study set out to answer?

This research aims to evaluate the effectiveness of various machine learning models in predicting the Langelier Saturation Index (LSI) from groundwater characteristics.

January 22, 2026Open Access

Comparative Analysis of Machine Learning Models for Prediction of Langelier Saturation Index in Groundwater of a River Basin

Puntos clave

This research aims to evaluate the effectiveness of various machine learning models in predicting the Langelier Saturation Index (LSI) from groundwater characteristics.
Applied five machine learning models: Ridge Regression, Support Vector Machine, Random Forest, Deep Neural Network, and XGBoost.
Performed data preprocessing including outlier removal and z-score normalization.
Optimized models using 10-fold cross-validation with a 70/30 train-test split.
Conducted feature importance analysis to identify key predictors of LSI.
XGBoost demonstrated the highest predictive accuracy (R2 = 0.98; RMSE = 0.06).
Random Forest closely followed with strong performance (R2 = 0.95).
Ridge Regression, while having the lowest performance (R2 = 0.90), still showed strong prediction capabilities.
pH was consistently identified as the most important predictor of LSI, followed by alkalinity and calcium.

Resumen

Accurate prediction of the Langelier Saturation Index (LSI), an indicator of water’s scaling and corrosive potential, is vital for water treatment and infrastructure maintenance. In this study, five machine learning models (Ridge Regression, Support Vector Machine, Random Forest, Deep Neural Network, and XGBoost) were applied to predict the LSI from physicochemical characteristics of groundwater in the Morava River basin (Serbia). Rigorous data preprocessing (outlier removal, missing data handling, z-score normalization) and feature selection were performed to ensure robust model training. Models were optimized via 10-fold cross-validation on a 70/30 train–test split. All models achieved high predictive accuracy, with ensemble methods outperforming others. XGBoost yielded the best performance (R2 = 0.98; RMSE = 0.06), followed closely by Random Forest (R2 = 0.95). The linear Ridge model showed the lowest (yet still strong) performance (R2 = 0.90) and larger errors at extreme LSI values. Feature importance analysis consistently identified pH as the most influential predictor of the LSI, followed by alkalinity and calcium. Partial dependence plots confirmed that the models captured established nonlinear LSI behavior. The LSI rises steeply with increasing pH and moderately with mineral content. Overall, this comparative study demonstrates that modern machine learning models can predict the LSI accurately, providing interpretable insights through feature importance and dependence plots. These results underscore the potential of data-driven approaches to complement traditional water stability indices for proactive water quality management.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo