Protein sequence classification is a fundamental step toward functional annotation and biological analysis; however, most of the existing approaches rely on computationally expensive models or flat feature integration with limited interpretability. This paper proposes a multi-phase rule-based protein classification framework that hierarchically integrates biologically meaningful features. An early-exit decision strategy is employed. Feature extraction is performed across four progressive phases comprising chemical properties, hydropathy characteristics, behavioral sequence patterns through n-gram and six-letter exchange encoding, and functional attributes via normalized distance-based encoding. Family-specific knowledge matrices constructed exclusively from training data enable deterministic min-max rule evaluation at each phase for efficient candidate pruning and early termination. Experiments conducted on 8, 555 human protein sequences spanning ten families achieve an average classification accuracy of 97. 66%. 5-fold, 25-fold, and 45-fold cross-validation statistically confirm performance stability under class-imbalanced conditions. Further cross-species validation on zebrafish and mouse protein datasets yields accuracies of 96. 11% and 94. 89%, respectively, demonstrating robust generalization across diverse biological distributions. Comparative evaluation against machine-learning-based and transformer-based baselines shows that the proposed framework achieves a superior accuracy-efficiency trade-off while maintaining an overall time complexity of O (n n).
Saha et al. (Thu,) studied this question.