What question did this study set out to answer?

This research aims to enhance the prediction accuracy of soil organic carbon content by optimizing data preprocessing and hyperparameter tuning using large language models.

April 1, 2026Open Access

Optimizing Data Preprocessing and Hyperparameter Tuning for Soil Organic Carbon Content Prediction Using Large Language Models: A Case Study of the Black Soil and Windblown Sandy Soil Regions in Northeast China

Key Points

This research aims to enhance the prediction accuracy of soil organic carbon content by optimizing data preprocessing and hyperparameter tuning using large language models.
Developed an intelligent modeling workflow driven by large language models.
Optimized data preprocessing and hyperparameter tuning for random forest modeling.
Evaluated the model in two regions: black soil and windblown sandy soil.
Achieved sample retention rates of 55.33% and 61.90% compared to traditional methods.
Reduced mean soil organic carbon content deviations to 30.27% and 20.05%.
Obtained R2 values of 0.394 and 0.694 in the black soil and sandy soil regions, respectively, with improved computational efficiency by over 95%.

Abstract

To address the current issues in soil organic carbon (SOC) content prediction where data preprocessing relies on expert experience to formulate fixed rules, resulting in a lack of uniform standards and insufficient consideration of regional soil heterogeneity; while hyperparameter tuning faces problems of high computational costs and excessively long runtimes, this study proposes an intelligent modeling workflow driven by Large Language Models (LLM). This workflow focuses on optimizing two key aspects of SOC Random Forest modeling: data preprocessing and hyperparameter tuning. Results: The LLM-defined rules achieved sample retention rates of 55.33% and 61.90% in the two regions, respectively, showing more significant differences compared to traditional hard-coded rules (56.2% and 59.3%), and the mean soil organic carbon content deviations (30.27% and 20.05%) were both lower than those of traditional hard-coding. At the same time, the mean soil organic carbon content values in both regions closely matched the effectiveness of other methods, indicating that the large language model has effectively captured regional soil differences. With only a single evaluation of hyperparameter optimization, the adaptive model achieved test set R2 values of 0.394 and 0.694 in the black soil region and the aeolian sandy soil region, respectively, with root mean square error values of 8.76 g/kg and 6.07 g/kg—its performance is comparable to that of Grid Search and Random Search, while computational efficiency improved by over 95%. Performance comparisons with eXtreme Gradient Boosting (XGBoost) and Partial Least Squares Regression (PLSR) show that the LLM-optimized Random Forest achieved R2 = 0.394 and RMSE = 8.76 g/kg in the black soil region, and R2 = 0.694 and RMSE = 6.07 g/kg in the windblown sandy soil region, demonstrating practical application value.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Cui et al. (Mon,) studied this question.

synapsesocial.com/papers/69ccb62016edfba7beb87c5b https://doi.org/https://doi.org/10.3390/app16073349

Bookmark

View Full Paper