The phenomenon of peptide self-assembly has both biological and technological applications and contributes toward creating biomaterials and nano-structures. Currently, predicting how a sequence of amino acids will self-assemble based on its composition using computational biology is problematic because of the non-linear relationship between peptide sequence and structure. In this study, a supervised machine learning model was created to classify peptides as either self-assembling (positive or 1) or non-self assembling (negative or 0) based on their characteristics derived from the sequence of amino acids. The dataset contained 42,532 sequences of peptides with associated positive or negative labels for self-assembly. K-mer-based representation was used for feature extraction of peptide sequences and multiple related classification algorithms were trained and evaluated in the visual programming environment provided by Orange3. Both linear and non-linear classifiers such as logistic regression, random forest, support vector machine, and neural networks were employed. All models were calibrated with k-fold cross-validation and assessed with the use of standard performance measures. The data reveal that non-linear modeling approaches outperform linear models in this context. This finding supports the assertion that the behaviors of self-assembling peptides are the result of complex non-linear sequence arrangements.
Angelina Wang (Sat,) studied this question.