What question did this study set out to answer?

To develop a novel alignment-free method for quantifying protein sequence similarities based on physicochemical properties.

April 19, 2026

An Entropy-Based Alignment-Free Mathematical Framework for Protein Sequence Similarity Analysis

Key Points

To develop a novel alignment-free method for quantifying protein sequence similarities based on physicochemical properties.
Proposed an alignment-free mathematical framework for sequence analysis.
Mapped amino acids to vectors in a real-valued feature space based on hydropathy index and dissociation constants.
Employed relative distance entropy to compare proteins without sequence alignment.
Represented protein sequences as discrete distributions in multidimensional parameter space.
Achieved similarity values ranging from 47% to 100%.
Provided comparable trends with traditional BLAST sequence identity percentages.
Demonstrated efficiency for large-scale protein similarity analysis.

Abstract

In this paper, a mathematical, alignment-free framework is proposed for quantifying similarity and dissimilarity among protein sequences using their physicochemical characteristics. The primary structure - the sequences of amino acids is modelled numerically by mapping each amino acid to a vector in a real-valued feature space determined by physicochemical properties, namely the hydropathy index (hI), first dissociation constant (pka1), and second dissociation constant (pka2). The considered properties play a key part in protein folding, stability, and function, and hence forms a meaningful basis for comparative analysis. The protein sequences are represented as discrete distributions using this numerical representation in a multidimensional parameter space. A Relative Distance Entropy measure is employed to compare the proteins independent of sequence alignment and length, thereby overcoming the limitations inherited in conventional homology-based methods. This enablessimilarity-measurement even in cases of low sequence identity while preserving functional characteristics. The proposed approach provides a computationally efficient and mathematically sound alternative for large-scale protein similarity analysis and functional classification. This method produces similarity values from 47%-100% providing a comparable trend with BLAST sequence identity percentage values for tested protein pairs. The proposed method emphasizes mathematical modelling and computational efficiency, making it suitable for large scale data.

Mark Helpful

Bookmark

Relay