Solubility of a chemical compound is highly dependent on both its molecular structure and the experimental conditions at which the solubility is measured, which makes it a challenging task to accurately predict solubility across aqueous and organic media. In this study, an interpretable machine learning framework was developed for predicting the solubility of drug-like compounds using large-scale curated data sets, including AqSolDB, AqSolDBc, BigSolDB, and BigSolDB 2.0. CatBoost models were trained and rigorously validated using repeated 5-fold cross-validation, employing scaffold-based splitting for aqueous solubility and cold solute–solvent pair splitting for organic solubility. Feature selection and hyperparameter tuning were systematically applied, with hyperparameter optimization emerging as the primary contributor to performance improvement. The optimized models demonstrated strong and statistically significant performance improvement over baseline configurations. Similarity-based domain-of-applicability analysis on the external data set revealed that prediction error increases with structural dissimilarity from the training set. SHAP analysis provided mechanistic insights, showing that aqueous solubility is primarily governed by polarity and hydrogen-bonding descriptors, whereas organic solubility is influenced by solvent characteristics, temperature, and molecular topology. The consistency of results across data sets demonstrates the robustness and transferability of the learned structure–property relationships and the applicability of the proposed approach for real-world solubility prediction tasks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Boinapalli Gopichand
Amrita Vishwa Vidyapeetham
GM Nair
Amrita Vishwa Vidyapeetham
B Amba Nair
Amrita Vishwa Vidyapeetham
ACS Omega
Amrita Vishwa Vidyapeetham
Building similarity graph...
Analyzing shared references across papers
Loading...
Gopichand et al. (Wed,) studied this question.
synapsesocial.com/papers/6a08093ca487c87a6a40b217 — DOI: https://doi.org/10.1021/acsomega.5c13630