Can LLMs predict the effects of all potential missense variants in the human genome? Predicting the effects of genetic variants on human proteins can be quite challenging. Existing methods struggle to accurately distinguish between harmful and benign variants, especially when it comes to missense variants that substitute one amino acid for another. Here, the authors explored two approaches: experimental methods like deep mutational scans (DMS), and computational methods like unsupervised homology-based techniques and protein language models (PLM). While DMS can capture molecular and cellular phenotypes, they have scalability challenges and are imperfect proxies for clinical outcomes. Alternatively, computational methods leverage protein properties and evolutionary constraints, but most are trained on labeled data, limiting their coverage. One such computational approach is EVE, an unsupervised deep-learning method based on generative variational autoencoders, but its predictions are constrained to well-aligned proteins. This study focused on ESM1b, a neural network based protein language model trained on millions of protein sequences. ESM1b's advantage lies in its ability to predict variant effects without relying on explicit homology, covering a broader range of variants. The researchers developed a workflow to use ESM1b to predict the effects of all possible missense variants in known human proteins. They evaluated their approach on various benchmarks and compared it with other variant effect prediction methods. The results showed that ESM1b outperformed other methods in classifying variant pathogenicity. The most impressive of these was ESM1b's ability to predict variant effects across different protein isoforms. The authors state that it was able to, “distinguish between pathogenic and benign variants [and] yield a true-positive rate of 81% and a true-negative rate of 82%.” However, ESM1b struggled with variants that led to nonsense-mediated decay (NMD), and the study utilized a sliding window approach for lengthy proteins, which could miss distant interactions. Validation against more experimental data will be crucial before applying ESM1b in real-world scenarios. The emergence of LPLMs like ESM1b offers a promising avenue for predicting variant effects. These models could improve diagnostic accuracy, aid genetic association studies, inform protein engineering, and uncover new insights into protein function. As LPLMs continue to advance, they hold promise for enhancing our understanding of genetic variants and their impacts on human health. ### Brandes N, Goldman G, Wang CH et al. 2023. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet. DOI: 10.1038/s41588-023-01465-0 🤖 This post was mostly written by ChatGPT. It only seemed right to let the LLM write about LPLMs. 🤖 I fixed all of the weird things it got wrong - like the most important result in the paper. 😬