Proteins, such as specialized catalysts called enzymes, offer transformative potential for sustainable chemical synthesis, environmental remediation, and advanced therapeutics. However, engineering proteins for specific industrial or clinical functions remains a formidable challenge due to the expansive and high-dimensional sequence design space. While directed evolution has facilitated significant breakthroughs, the methodology is often constrained by slow iteration cycles and the requirement for a functional starting point. This thesis introduces novel machine learning frameworks, integrated with experimental workflows, to transcend traditional enzyme engineering approaches. We first present Active Learning-Assisted Directed Evolution (ALDE), which employs Bayesian optimization to enable more efficient optimization of protein properties. Afterward, Contrastive Reaction-Enzyme Pretraining (CREEP) is introduced for the annotation and discovery of enzymes with desired "new-to-nature" functionalities. Finally, a new paradigm for Steering Generation for Protein Optimization (SGPO) is demonstrated, unifying these two perspectives into a holistic generative framework for efficient protein engineering. Collectively, these innovations advance the transition toward automated, artificial intelligence-driven biomolecular design–unlocking sustainable synthesis, novel therapeutics, and programmable biology at the molecular level.
Jason Yang (Mon,) studied this question.