What question did this study set out to answer?

The study aims to develop an effective framework for automating enzyme classification using dimensionality-reduced sequence features.

April 10, 2026

EC-Design: A Robust Framework for Enzyme Function Prediction Using Dimensionality-Reduced Sequence Features

Key Points

The study aims to develop an effective framework for automating enzyme classification using dimensionality-reduced sequence features.
Utilized principal component analysis and Fisher Score-based feature selection
Analyzed 134,153 validated enzyme sequences
Compared k-nearest neighbors with six other machine learning methods
Evaluated model performance through accuracy, macro-F1, and AUC metrics
k-NN achieved a top accuracy of 74.59% and a macro-F1 score of 0.6859
Model demonstrated robust generalization with 74.37 ± 0.49% accuracy
Dipeptide patterns containing asparagine and glycine were identified as key features
High precision (>0.83) for majority classes, high recall (>0.83) for minority classes

Abstract

Automated enzyme classification is hindered by high-dimensional feature spaces and extreme class imbalance. In this study, we introduce EC-Design, a machine learning framework that utilizes principal component analysis (PCA) and Fisher Score-based feature selection on 134,153 validated sequences. Challenging the perceived superiority of ensemble methods, k-nearest neighbors (k-NN) achieved a top accuracy of 74.59% and a macro-F1 of 0.6859, significantly outperforming six comparative machine learning methods, including random forest, ensemble bagging, linear discriminant analysis, logistic regression, support vector machine, and multilayer perceptron. The model demonstrated robust generalization (74.37 ± 0.49%) and excellent discriminative power (mean AUC = 0.937). Performance analysis revealed a distinct trade-off: majority classes (EC1-EC3) exhibited high precision (>0.83), while minority classes (EC4-EC7) achieved high recall (>0.83). Feature importance analysis revealed that dipeptide patterns containing asparagine and glycine serve as key discriminative features, which aligns with established catalytic motifs and provides biologically interpretable insights into enzyme function. The EC-Design framework establishes instance-based learning as an efficient and accurate approach for large-scale enzyme annotation, offering a transparent alternative to complex ensemble models.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Huanghui Xia

Hua Xia

Feng Qi

Journals

Journal of Chemical Information and Modeling

Actions

Institutions

Ministry of Education of the People's Republic of China

Fujian Normal University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

EC-Design: A Robust Framework for Enzyme Function Prediction Using Dimensionality-Reduced Sequence Features

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study