March 3, 2026Open Access

Prediktion av proteinvarianteffekt med proteinspråkmodellsinbäddningar och en matris variationell autokodare

Key Points

The proposed model enhances prediction performance using protein language model embeddings, achieving an average Spearman's rank correlation of 0.38.
By comparing different approaches, the study highlights the performance of reconstruction setups, identifying strengths and weaknesses such as decreased usability in some setups.
Innovations combine a matrix variational autoencoder with embeddings to tackle limitations of existing variant effect predictors in computational analysis.
Future research is essential for expanding the dataset and optimizing the evolutionary scale modeling for comprehensive evaluations.

Abstract

The variant effect predictor (VEP) is an important computational tool for understanding protein variant function and can be used to aid in disease prevention and treatment. A myriad of VEPs exist, and even though their performance has improved with the use of machine learning, they do not perform well enough to be used reliably. One of the best performing model types is the protein language model (PLM), especially those with a large number of parameters (100M). However, their embeddings lack interpretability compared to VEPs based on variational Bayes (VAEs) with a structured latent space. This thesis investigates a novel setup that combines a PLM and a matrix VAE to improve the prediction performance of the latter. The suggested approach uses the embeddings from an Evolutionary Scale Modeling (ESM) model as input to the mat-VAE. Furthermore, two approaches to the setup are investigated. The first one reconstructs the protein sequences, while the other reconstructs the embeddings. Each setup is experimentally evaluated for embedding for various dimensions, and variations of the matVAE+ESM model are trained on a set of pharmacogene-related proteins used for the development of the original mat-VAE. Finally, the best novel model is compared to other state-of-the-art models by using datasets and performance benchmarks provided by ProteinGym. The best setup is found to reconstruct the proteins and use 100 embedding features with a latent space of dimension D = 5. We report an average Spearman’s rank correlation coefficient of 0.38 ± 0.189, which is better compared to the original matVAE (0.35 ± 0.174), as well as using the embeddings directly (0.172±0.111). However, the embedding reconstruction setup decreases the performance substantially and is not found to be a usable solution. Additionally, compared to the best state-of-the-art benchmark models, our best approach still lacks in performance. This work thus showcases that using embeddings from a PLM improves the performance of the matVAE. However, further research is needed for a larger set of proteins and on the best ESM model available to exhaustively evaluate the potential of the setup.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

David Oxelmark

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Prediktion av proteinvarianteffekt med proteinspråkmodellsinbäddningar och en matris variationell autokodare

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study