What question did this study set out to answer?

The study aims to develop an efficient and interpretable framework for protein classification based on hierarchical feature extraction.

February 13, 2026

Rule-Based Protein Classification through Multi-Phase Feature Extraction Technique

Key Points

The study aims to develop an efficient and interpretable framework for protein classification based on hierarchical feature extraction.
Employs a multi-phase rule-based classification framework.
Integrates features across four phases: chemical properties, hydropathy, sequence patterns, and functional attributes.
Utilizes a min-max rule evaluation for efficient candidate pruning.
Applies 5-fold, 25-fold, and 45-fold cross-validation to assess performance.
Achieves an average classification accuracy of 97.66% on 8,555 human protein sequences.
Confirms performance stability under class-imbalanced conditions through cross-validation.
Demonstrates robust generalization with 96.11% accuracy on zebrafish and 94.89% on mouse datasets.
Outperforms machine-learning-based and transformer-based methods in accuracy and efficiency.

Abstract

Protein sequence classification is a fundamental step toward functional annotation and biological analysis; however, most of the existing approaches rely on computationally expensive models or flat feature integration with limited interpretability. This paper proposes a multi-phase rule-based protein classification framework that hierarchically integrates biologically meaningful features. An early-exit decision strategy is employed. Feature extraction is performed across four progressive phases comprising chemical properties, hydropathy characteristics, behavioral sequence patterns through n-gram and six-letter exchange encoding, and functional attributes via normalized distance-based encoding. Family-specific knowledge matrices constructed exclusively from training data enable deterministic min-max rule evaluation at each phase for efficient candidate pruning and early termination. Experiments conducted on 8, 555 human protein sequences spanning ten families achieve an average classification accuracy of 97. 66%. 5-fold, 25-fold, and 45-fold cross-validation statistically confirm performance stability under class-imbalanced conditions. Further cross-species validation on zebrafish and mouse protein datasets yields accuracies of 96. 11% and 94. 89%, respectively, demonstrating robust generalization across diverse biological distributions. Comparative evaluation against machine-learning-based and transformer-based baselines shows that the proposed framework achieves a superior accuracy-efficiency trade-off while maintaining an overall time complexity of O (n n).

Bookmark

Rule-Based Protein Classification through Multi-Phase Feature Extraction Technique

Key Points

Abstract

Cite This Study