This study develops a machine learning-based predictive model for identifying high-level clouds (HLCs). The model uses meteorological parameters as input features and is trained against human-recorded meteorological observations. A statistical analysis of the relationship between two independent methods of registering HLCs—lidar and meteorological observations—has been performed. Optimal thresholds for the total amount of cloud cover, at which meteorological data are consistent with lidar data, have been determined. The results demonstrate the promising performance of ML models in identifying the links between weather conditions and the probability of HLC detection, which is confirmed by ROC AUC (Area Under the Curve of the Receiver Operating Characteristic) values in the range of 0.87–0.88 for the presence and 0.77–0.78 for the absence of clouds, as well as balanced metrics Precision, Recall, and F1. The XGBoost (eXtreme Gradient Boosting) model proved to be the most robust, demonstrating the ability to effectively integrate data of various types for reliable prediction in various conditions.
Penzin et al. (Mon,) studied this question.