Despite transformative progress in protein structure prediction and de novo design, accurate prediction of enzyme function remains elusive. The central barrier is not a lack of sequences or structures, but a lack of learnable, experimentally grounded labels: in enzyme engineering, function is defined by assay context and deployment constraints, yet most available sequence–function datasets are sparse, biased toward successes, incompletely genotyped, and inconsistently annotated. As a result, models that appear strong on curated benchmarks often fail to guide real engineering decisions, especially for new-to-nature transformations and out-of-distribution substrates. This thesis advances a data-centered strategy for functional prediction by building a practical pipeline that turns routine protein engineering experiments into prediction-ready sequence–function records. Chapter 2 establishes the importance of assay-defined, application-relevant labels through a stereodivergent biocatalytic platform for azaspirocycle synthesis, where directed evolution of carbene transferases enables access to azaspiro2.yalkane scaffolds with tunable stereochemical outcomes, quantified by deployment-aligned measurements including turnover and enantio- and diastereoselectivity. Chapter 3 develops LevSeq, a high-throughput and cost-effective genotyping and analysis workflow that assigns full-length variant identities at screening scale using dual barcoding and nanopore long reads and reports practical quality-control metrics to prevent silent failure modes and preserve reliable sequence assignment across both hits and non-hits. Chapter 4 introduces EnzEngDB, a curated database and data model that standardizes sequences, reaction context, assay conditions, units, quality metrics, and provenance so datasets can be queried, filtered, and aggregated across campaigns, while retaining negative and neutral outcomes as informative constraints. Chapter 5 closes the loop by using the resulting data substrate to support decision-making in de novo enzyme validation, emphasizing that predictive ranking is currently most reliable within reactions with dense standardized coverage and hypothesize that generalization to unseen reactions will require systematic expansion of assay-defined labels across diverse transformations. Together, these chapters provide the tooling and framework needed to make sequence–function data compounding rather than ephemeral, enabling reliable functional prediction and iterative, model-guided enzyme design.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yueming Long
California Institute of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Yueming Long (Wed,) studied this question.
www.synapsesocial.com/papers/69d8967d6c1944d70ce07fef — DOI: https://doi.org/10.7907/rqgx-xv16