Catalogue Search | MBRL

Distance-based protein folding powered by deep learning

by Xu, Jinbo in Algorithms , Biological Sciences , Biophysics and Computational Biology

2019

Direct coupling analysis (DCA) for protein folding has made very good progress, but it is not effective for proteins that lack many sequence homologs, even coupled with time-consuming conformation sampling with fragments. We show that we can accurately predict interresidue distance distribution of a protein by deep learning, even for proteins with ∼60 sequence homologs. Using only the geometric constraints given by the resulting distance matrix we may construct 3D models without involving extensive conformation sampling. Our method successfully folded 21 of the 37 CASP12 hard targets with a median family size of 58 effective sequence homologs within 4 h on a Linux computer of 20 central processing units. In contrast, DCA-predicted contacts cannot be used to fold any of these hard targets in the absence of extensive conformation sampling, and the best CASP12 group folded only 11 of them by integrating DCA-predicted contacts into fragment-based conformation sampling. Rigorous experimental validation in CASP13 shows that our distance-based folding server successfully folded 17 of 32 hard targets (with a median family size of 36 sequence homologs) and obtained 70% precision on the top L/5 long-range predicted contacts. The latest experimental validation in CAMEO shows that our server predicted correct folds for 2 membrane proteins while all of the other servers failed. These results demonstrate that it is now feasible to predict correct fold for many more proteins lack of similar structures in the Protein Data Bank even on a personal computer.

Journal Article

Share this book

Add to My Shelf

Inferring interaction partners from protein sequences

by Colwell, Lucy J. , Dwyer, Robert S. , Wingreen, Ned S. in ABC transporters , Algorithms , Biological Physics

2016

Specific protein–protein interactions are crucial in the cell, both to ensure the formation and stability of multiprotein complexes and to enable signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interaction partners, causing their sequences to be correlated. Here we exploit these correlations to accurately identify, from sequence data alone, which proteins are specific interaction partners. Our general approach, which employs a pairwise maximum entropy model to infer couplings between residues, has been successfully used to predict the 3D structures of proteins from sequences. Thus inspired, we introduce an iterative algorithm to predict specific interaction partners from two protein families whose members are known to interact. We first assess the algorithm’s performance on histidine kinases and response regulators from bacterial two-component signaling systems. We obtain a striking 0.93 true positive fraction on our complete dataset without any a priori knowledge of interaction partners, and we uncover the origin of this success. We then apply the algorithm to proteins from ATP-binding cassette (ABC) transporter complexes, and obtain accurate predictions in these systems as well. Finally, we present two metrics that accurately distinguish interacting protein families from noninteracting ones, using only sequence data.

Journal Article

Share this book

Add to My Shelf

Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis

by Weigt, Martin , Gueudré, Thomas , Baldassi, Carlo in Biological Sciences , Biology , Biophysics and Computational Biology

2016

Understanding protein–protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein–protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue–residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has, in turn, been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being colocalized in operons. Here we show that the direct coupling analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify interprotein residue–residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.

Journal Article

Share this book

Add to My Shelf

How Pairwise Coevolutionary Models Capture the Collective Residue Variability in Proteins?

by Weigt, Martin , Figliuzzi, Matteo , Barrat-Charlaix, Pierre in Amino acid sequence , Amino acids , Correlation

2018

Global coevolutionary models of homologous protein families, as constructed by direct coupling analysis (DCA), have recently gained popularity in particular due to their capacity to accurately predict residue–residue contacts from sequence information alone, and thereby to facilitate tertiary and quaternary protein structure prediction. More recently, they have also been used to predict fitness effects of amino-acid substitutions in proteins, and to predict evolutionary conserved protein–protein interactions. These models are based on two currently unjustified hypotheses: 1) correlations in the amino-acid usage of different positions are resulting collectively from networks of direct couplings; and 2) pairwise couplings are sufficient to capture the amino-acid variability. Here, we propose a highly precise inference scheme based on Boltzmann-machine learning, which allows us to systematically address these hypotheses. We show how correlations are built up in a highly collective way by a large number of coupling paths, which are based on the proteins three-dimensional structure. We further find that pairwise coevolutionary models capture the collective residue variability across homologous proteins even for quantities which are not imposed by the inference procedure, like three-residue correlations, the clustered structure of protein families in sequence space or the sequence distances between homologs. These findings strongly suggest that pairwise coevolutionary models are actually sufficient to accurately capture the residue variability in homologous protein families.

Journal Article

Share this book

Add to My Shelf

Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes

by Muscat, Maureen , Weigt, Martin , Rodriguez-Rivas, Juan in Algorithms , Amino acids , Area Under Curve

2022

The emergence of new variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major concern given their potential impact on the transmissibility and pathogenicity of the virus as well as the efficacy of therapeutic interventions. Here, we predict the mutability of all positions in SARS-CoV-2 protein domains to forecast the appearance of unseen variants. Using sequence data from other coronaviruses, preexisting to SARS-CoV-2, we build statistical models that not only capture amino acid conservation but also more complex patterns resulting from epistasis. We show that these models are notably superior to conservation profiles in estimating the already observable SARS-CoV-2 variability. In the receptor binding domain of the spike protein, we observe that the predicted mutability correlates well with experimental measures of protein stability and that both are reliable mutability predictors (receiver operating characteristic areas under the curve ∼0.8). Most interestingly, we observe an increasing agreement between our model and the observed variability as more data become available over time, proving the anticipatory capacity of our model. When combined with data concerning the immune response, our approach identifies positions where current variants of concern are highly overrepresented. These results could assist studies on viral evolution and future viral outbreaks and, in particular, guide the exploration and anticipation of potentially harmful future SARS-CoV-2 variants.

Journal Article

Share this book

Add to My Shelf

Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis

by Szurmant, Hendrik , Weigt, Martin , Oteri, Francesco in Biological Sciences , Biophysics and Computational Biology , Catalysts

2017

Proteins have evolved to perform diverse cellular functions, from serving as reaction catalysts to coordinating cellular propagation and development. Frequently, proteins do not exert their full potential as monomers but rather undergo concerted interactions as either homo-oligomers or with other proteins as hetero-oligomers. The experimental study of such protein complexes and interactions has been arduous. Theoretical structure prediction methods are an attractive alternative. Here, we investigate homo-oligomeric interfaces by tracing residue coevolution via the global statistical direct coupling analysis (DCA). DCA can accurately infer spatial adjacencies between residues. These adjacencies can be included as constraints in structure prediction techniques to predict high-resolution models. By taking advantage of the ongoing exponential growth of sequence databases, we go significantly beyond anecdotal cases of a few protein families and apply DCA to a systematic large-scale study of nearly 2,000 Pfam protein families with sufficient sequence information and structurally resolved homo-oligomeric interfaces. We find that large interfaces are commonly identified by DCA. We further demonstrate that DCA can differentiate between subfamilies with different binding modes within one large Pfam family. Sequence-derived contact information for the subfamilies proves sufficient to assemble accurate structural models of the diverse protein-oligomers. Thus, we provide an approach to investigate oligomerization for arbitrary protein families leading to structural models complementary to often-difficult experimental methods. Combined with ever more abundant sequential data, we anticipate that this study will be instrumental to allow the structural description of many heteroprotein complexes in the future.

Journal Article

Share this book

Add to My Shelf

Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes

by Zeng, Hong-Li , Dichio, Vito , Aurell, Erik in Annan fysik , Applied Physical Sciences , Biological Sciences

2020

Genome-wide epistasis analysis is a powerful tool to infer gene interactions, which can guide drug and vaccine development and lead to deeper understanding of microbial pathogenesis. We have considered all complete severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes deposited in the Global Initiative on Sharing All Influenza Data (GISAID) repository until four different cutoff dates, and used direct coupling analysis together with an assumption of quasi-linkage equilibrium to infer epistatic contributions to fitness from polymorphic loci. We find eight interactions, of which three are between pairs where one locus lies in gene ORF3a, both loci holding nonsynonymous mutations. We also find interactions between two loci in gene nsp13, both holding nonsynonymous mutations, and four interactions involving one locus holding a synonymous mutation. Altogether, we infer interactions between loci in viral genes ORF3a and nsp2, nsp12, and nsp6, between ORF8 and nsp4, and between loci in genes nsp2, nsp13, and nsp14. The paper opens the prospect to use prominent epistatically linked pairs as a starting point to search for combinatorial weaknesses of recombinant viral pathogens.

Journal Article

Share this book

Add to My Shelf

Direct coupling analysis and the attention mechanism

by Pagnani, Andrea , Caredda, Francesco in Accuracy , Algorithms , Amino acid sequence

2025

Proteins are involved in nearly all cellular functions, encompassing roles in transport, signaling, enzymatic activity, and more. Their functionalities crucially depend on their complex three-dimensional arrangement. For this reason, being able to predict their structure from the amino acid sequence has been and still is a phenomenal computational challenge that the introduction of AlphaFold solved with unprecedented accuracy. However, the inherent complexity of AlphaFold's architectures makes it challenging to understand the rules that ultimately shape the protein's predicted structure. This study investigates a single-layer unsupervised model based on the attention mechanism. More precisely, we explore a Direct Coupling Analysis (DCA) method that mimics the attention mechanism of several popular Transformer architectures, such as AlphaFold itself. The model's parameters, notably fewer than those in standard DCA-based algorithms, can be directly used for extracting structural determinants such as the contact map of the protein family under study. Additionally, the functional form of the energy function of the model enables us to deploy a multi-family learning strategy, allowing us to effectively integrate information across multiple protein families, whereas standard DCA algorithms are typically limited to single protein families. Finally, we implemented a generative version of the model using an autoregressive architecture, capable of efficiently generating new proteins in silico.

Journal Article

Share this book

Add to My Shelf

Unsupervised Inference of Protein Fitness Landscape from Deep Mutational Scan

by Uguzzoni, Guido , Pagnani, Andrea , Fernandez-de-Cossio-Diaz, Jorge in Biology , Combinatorial analysis , Combinatorial libraries

2021

The recent technological advances underlying the screening of large combinatorial libraries in high-throughput mutational scans deepen our understanding of adaptive protein evolution and boost its applications in protein design. Nevertheless, the large number of possible genotypes requires suitable computational methods for data analysis, the prediction of mutational effects, and the generation of optimized sequences. We describe a computational method that, trained on sequencing samples from multiple rounds of a screening experiment, provides a model of the genotype–fitness relationship. We tested the method on five large-scale mutational scans, yielding accurate predictions of the mutational effects on fitness. The inferred fitness landscape is robust to experimental and sampling noise and exhibits high generalization power in terms of broader sequence space exploration and higher fitness variant predictions. We investigate the role of epistasis and show that the inferred model provides structural information about the 3D contacts in the molecular fold.

Journal Article

Share this book

Add to My Shelf

High-throughput droplet-based analysis of influenza A virus genetic reassortment by single-virus RNA sequencing

by Ibanez, Pablo , van der Werf, Sylvie , Isel, Catherine in [SDV]Life Sciences [q-bio] , Applied Physical Sciences , Coinfection

2023

The segmented RNA genome of influenza A viruses (IAVs) enables viral evolution through genetic reassortment after multiple IAVs coinfect the same cell, leading to viruses harboring combinations of eight genomic segments from distinct parental viruses. Existing data indicate that reassortant genotypes are not equiprobable; however, the low throughput of available virology techniques does not allow quantitative analysis. Here, we have developed a high-throughput single-cell droplet microfluidic system allowing encapsulation of IAV-infected cells, each cell being infected by a single progeny virion resulting from a coinfection process. Customized barcoded primers for targeted viral RNA sequencing enabled the analysis of 18,422 viral genotypes resulting from coinfection with two circulating human H1N1pdm09 and H3N2 IAVs. Results were highly reproducible, confirmed that genetic reassortment is far from random, and allowed accurate quantification of reassortants including rare events. In total, 159 out of the 254 possible reassortant genotypes were observed but with widely varied prevalence (from 0.038 to 8.45%). In cells where eight segments were detected, all 112 possible pairwise combinations of segments were observed. The inclusion of data from single cells where less than eight segments were detected allowed analysis of pairwise cosegregation between segments with very high confidence. Direct coupling analysis accurately predicted the fraction of pairwise segments and full genotypes. Overall, our results indicate that a large proportion of reassortant genotypes can emerge upon coinfection and be detected over a wide range of frequencies, highlighting the power of our tool for systematic and exhaustive monitoring of the reassortment potential of IAVs.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter