What question did this study set out to answer?

The aim is to create a reliable framework for predicting enzyme functions based on experimental data.

April 10, 2026Open Access

Data Foundations for Functional Prediction in Enzyme Engineering

Key Points

The aim is to create a reliable framework for predicting enzyme functions based on experimental data.
Developed a pipeline for transforming protein engineering experiments into sequence-function records.
Established assay-relevant labels using a biocatalytic platform for specific compound synthesis.
Created LevSeq, a cost-effective genotyping workflow for high-throughput analysis.
Introduced EnzEngDB, a standardized database for sequence and assay information.
Demonstrated the effectiveness of application-relevant labels in predicting enzyme performance.
LevSeq allows reliable screening of enzymes with consistent quality metrics.
EnzEngDB facilitates better data aggregation and querying across different enzyme engineering campaigns.
The study hypothesizes that diverse labeling will enhance predictive capabilities for unseen reactions.

Abstract

Despite transformative progress in protein structure prediction and de novo design, accurate prediction of enzyme function remains elusive. The central barrier is not a lack of sequences or structures, but a lack of learnable, experimentally grounded labels: in enzyme engineering, function is defined by assay context and deployment constraints, yet most available sequence–function datasets are sparse, biased toward successes, incompletely genotyped, and inconsistently annotated. As a result, models that appear strong on curated benchmarks often fail to guide real engineering decisions, especially for new-to-nature transformations and out-of-distribution substrates. This thesis advances a data-centered strategy for functional prediction by building a practical pipeline that turns routine protein engineering experiments into prediction-ready sequence–function records. Chapter 2 establishes the importance of assay-defined, application-relevant labels through a stereodivergent biocatalytic platform for azaspirocycle synthesis, where directed evolution of carbene transferases enables access to azaspiro2.yalkane scaffolds with tunable stereochemical outcomes, quantified by deployment-aligned measurements including turnover and enantio- and diastereoselectivity. Chapter 3 develops LevSeq, a high-throughput and cost-effective genotyping and analysis workflow that assigns full-length variant identities at screening scale using dual barcoding and nanopore long reads and reports practical quality-control metrics to prevent silent failure modes and preserve reliable sequence assignment across both hits and non-hits. Chapter 4 introduces EnzEngDB, a curated database and data model that standardizes sequences, reaction context, assay conditions, units, quality metrics, and provenance so datasets can be queried, filtered, and aggregated across campaigns, while retaining negative and neutral outcomes as informative constraints. Chapter 5 closes the loop by using the resulting data substrate to support decision-making in de novo enzyme validation, emphasizing that predictive ranking is currently most reliable within reactions with dense standardized coverage and hypothesize that generalization to unseen reactions will require systematic expansion of assay-defined labels across diverse transformations. Together, these chapters provide the tooling and framework needed to make sequence–function data compounding rather than ephemeral, enabling reliable functional prediction and iterative, model-guided enzyme design.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yueming Long

Actions

Institutions

California Institute of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Data Foundations for Functional Prediction in Enzyme Engineering

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study