In this paper, a mathematical, alignment-free framework is proposed for quantifying similarity and dissimilarity among protein sequences using their physicochemical characteristics. The primary structure - the sequences of amino acids is modelled numerically by mapping each amino acid to a vector in a real-valued feature space determined by physicochemical properties, namely the hydropathy index (hI), first dissociation constant (pka1), and second dissociation constant (pka2). The considered properties play a key part in protein folding, stability, and function, and hence forms a meaningful basis for comparative analysis. The protein sequences are represented as discrete distributions using this numerical representation in a multidimensional parameter space. A Relative Distance Entropy measure is employed to compare the proteins independent of sequence alignment and length, thereby overcoming the limitations inherited in conventional homology-based methods. This enablessimilarity-measurement even in cases of low sequence identity while preserving functional characteristics. The proposed approach provides a computationally efficient and mathematically sound alternative for large-scale protein similarity analysis and functional classification. This method produces similarity values from 47%-100% providing a comparable trend with BLAST sequence identity percentage values for tested protein pairs. The proposed method emphasizes mathematical modelling and computational efficiency, making it suitable for large scale data.
Vijayalakshmi et al. (Thu,) studied this question.