Foundation models require massive amounts of training data, which are often costly to obtain in materials science. Meanwhile, long-established physical knowledge such as molecular force fields and geometric analysis provides direct guidance for material behavior, but remains insufficiently leveraged. Here we demonstrate that expert knowledge can directly supervise pre-training and substantially reduce data requirements. A set of potential energy surface (PES) basis functions, which encode guest-host interaction energetics, is developed as unified descriptors for different guest molecules. A multi-modal architecture is designed to fuse information from both material structure and PES. Pre-training is achieved by learning comprehensive geometric features spanning different spatial scales. Consequently, a foundation model for porous materials is developed under limited data regimes, named SpbNet. SpbNet is evaluated on over 50 downstream tasks, including adsorption, separation, and intrinsic properties, etc. SpbNet consistently outperforms models pre-trained on datasets nearly 20 times larger, reducing the relative errors by over 20%. In addition, SpbNet demonstrates strong generalization capabilities across both in-distribution and out-of-distribution materials, such as Metal Organic Frameworks, Covalent Organic Frameworks, and zeolites.
Zou et al. (Wed,) studied this question.