Abstract Structure-based drug discovery (SBDD) aims to identify novel molecules that bind to therapeutic protein targets. The vast chemical space and limitations of traditional approaches make this task challenging. Recent advances in AI-generative models, such as flow matching, can produce novel, pocket-conditioned molecular structures directly in three-dimensional space. However, most pocket conditioned models in the literature are trained on structures derived from the Protein Data Bank (PDB), which contains structures with varying quality and inconsistent annotation. Moreover, the PDB is enriched with cofactors and natural products, thereby poorly representing real world SBDD scenarios. The relatively limited number of ligand series within the same pockets also hinder the model’s ability to learn protein-ligand interactions effectively. Here for the first time we report the results of training pocket-conditioned generative models on internal crystallography data from a large pharmaceutical company. We also investigate other key determinants of model performance, such as inclusion of hydrogens and pretraining on unconditional data. We evaluate how each factor affects the generative quality of the ligands across the diverse training settings. Our results provide practical guidelines for the development of more effective 3D generative models for SBDD and highlight key directions for future research toward reliable, pocket-aware molecular design.
Wang et al. (Sat,) studied this question.