Accurately predicting protein-protein interactions (PPIs) in dimeric complexes remains a fundamental challenge in computational biology. Although existing PPIs prediction models, such as AlphaFold-Multimer (AF-Multimer) and AlphaFold3 (AF3), have achieved impressive performance, they still suffer from unsatisfactory accuracy due to the limited availability of protein dimer structures, whose collection is both expensive and labor-intensive. Here, we introduce a simple yet effective pre-training method, termed split and merge proxy (SMP), that leverages abundant monomeric proteins to simulate various PPIs tasks for the first time. Specifically, SMP constructs pseudo-dimers by splitting monomer data into two subunits, referred to as pseudo-receptors and pseudo-ligands, and trains models to merge them back by predicting their pseudo interactions (e.g., contact or docking). This proxy task enables large-scale pre-training without additional cost. Models pre-trained with SMP and subsequently fine-tuned on real protein dimer datasets demonstrate consistently improved accuracy and generalization across multiple benchmarks, surpassing strong baselines. Notably, SMP delivers more accurate structure predictions than both AF-Multimer and AF3 on several CASP15 dimer targets. Our findings highlight SMP as a scalable strategy for harnessing monomeric data to advance protein complex modeling, providing insights into the linkage between monomers and multimers. Accurate prediction of protein-protein interactions is limited by the scarcity of high-quality complex structures. Here, authors introduce SMP, a strategy that leverages pseudo-dimers derived from monomers to improve accuracy and generalization across diverse protein interaction applications.
Du et al. (Sat,) studied this question.