To address the current issues in soil organic carbon (SOC) content prediction where data preprocessing relies on expert experience to formulate fixed rules, resulting in a lack of uniform standards and insufficient consideration of regional soil heterogeneity; while hyperparameter tuning faces problems of high computational costs and excessively long runtimes, this study proposes an intelligent modeling workflow driven by Large Language Models (LLM). This workflow focuses on optimizing two key aspects of SOC Random Forest modeling: data preprocessing and hyperparameter tuning. Results: The LLM-defined rules achieved sample retention rates of 55.33% and 61.90% in the two regions, respectively, showing more significant differences compared to traditional hard-coded rules (56.2% and 59.3%), and the mean soil organic carbon content deviations (30.27% and 20.05%) were both lower than those of traditional hard-coding. At the same time, the mean soil organic carbon content values in both regions closely matched the effectiveness of other methods, indicating that the large language model has effectively captured regional soil differences. With only a single evaluation of hyperparameter optimization, the adaptive model achieved test set R2 values of 0.394 and 0.694 in the black soil region and the aeolian sandy soil region, respectively, with root mean square error values of 8.76 g/kg and 6.07 g/kg—its performance is comparable to that of Grid Search and Random Search, while computational efficiency improved by over 95%. Performance comparisons with eXtreme Gradient Boosting (XGBoost) and Partial Least Squares Regression (PLSR) show that the LLM-optimized Random Forest achieved R2 = 0.394 and RMSE = 8.76 g/kg in the black soil region, and R2 = 0.694 and RMSE = 6.07 g/kg in the windblown sandy soil region, demonstrating practical application value.
Cui et al. (Mon,) studied this question.