The variant effect predictor (VEP) is an important computational tool for understanding protein variant function and can be used to aid in disease prevention and treatment. A myriad of VEPs exist, and even though their performance has improved with the use of machine learning, they do not perform well enough to be used reliably. One of the best performing model types is the protein language model (PLM), especially those with a large number of parameters (100M). However, their embeddings lack interpretability compared to VEPs based on variational Bayes (VAEs) with a structured latent space. This thesis investigates a novel setup that combines a PLM and a matrix VAE to improve the prediction performance of the latter. The suggested approach uses the embeddings from an Evolutionary Scale Modeling (ESM) model as input to the mat-VAE. Furthermore, two approaches to the setup are investigated. The first one reconstructs the protein sequences, while the other reconstructs the embeddings. Each setup is experimentally evaluated for embedding for various dimensions, and variations of the matVAE+ESM model are trained on a set of pharmacogene-related proteins used for the development of the original mat-VAE. Finally, the best novel model is compared to other state-of-the-art models by using datasets and performance benchmarks provided by ProteinGym. The best setup is found to reconstruct the proteins and use 100 embedding features with a latent space of dimension D = 5. We report an average Spearman’s rank correlation coefficient of 0.38 ± 0.189, which is better compared to the original matVAE (0.35 ± 0.174), as well as using the embeddings directly (0.172±0.111). However, the embedding reconstruction setup decreases the performance substantially and is not found to be a usable solution. Additionally, compared to the best state-of-the-art benchmark models, our best approach still lacks in performance. This work thus showcases that using embeddings from a PLM improves the performance of the matVAE. However, further research is needed for a larger set of proteins and on the best ESM model available to exhaustively evaluate the potential of the setup.
Building similarity graph...
Analyzing shared references across papers
Loading...
David Oxelmark
Building similarity graph...
Analyzing shared references across papers
Loading...
David Oxelmark (Wed,) studied this question.