Catalogue Search | MBRL

Accurate secondary structure prediction and fold recognition for circular dichroism spectroscopy

by Yuji Goto , Micsonai, AndraÌs , Young-Ho Lee in Algorithms , amyloid , Amyloid beta-Peptides - chemistry

2015

Circular dichroism (CD) spectroscopy is a widely used technique for the study of protein structure. Numerous algorithms have been developed for the estimation of the secondary structure composition from the CD spectra. These methods often fail to provide acceptable results on Î±/Î²-mixed or Î²-structureârich proteins. The problem arises from the spectral diversity of Î²-structures, which has hitherto been considered as an intrinsic limitation of the technique. The predictions are less reliable for proteins of unusual Î²-structures such as membrane proteins, protein aggregates, and amyloid fibrils. Here, we show that the parallel/antiparallel orientation and the twisting of the Î²-sheets account for the observed spectral diversity. We have developed a method called Î²-structure selection (BeStSel) for the secondary structure estimation that takes into account the twist of Î²-structures. This method can reliably distinguish parallel and antiparallel Î²-sheets and accurately estimates the secondary structure for a broad range of proteins. Moreover, the secondary structure components applied by the method are characteristic to the protein fold, and thus the fold can be predicted to the level of topology in the CATH classification from a single CD spectrum. By constructing a web server, we offer a general tool for a quick and reliable structure analysis using conventional CD or synchrotron radiation CD (SRCD) spectroscopy for the protein science research community. The method is especially useful when X-ray or NMR techniques fail. Using BeStSel on data collected by SRCD spectroscopy, we investigated the structure of amyloid fibrils of various disease-related proteins and peptides. Significance Circular dichroism (CD) spectroscopy is widely used for protein secondary structure analysis. However, quantitative estimation for Î²-sheetâcontaining proteins is problematic due to the huge morphological and spectral diversity of Î²-structures. We show that parallel/antiparallel orientation and twisting of Î²-sheets account for the observed spectral diversity. Taking into account the twist of Î²-structures, our method accurately estimates the secondary structure for a broad range of protein folds, particularly for Î²-sheetârich proteins and amyloid fibrils. Moreover, the method can predict the protein fold down to the topology level following the CATH classification. We provide a general tool for a quick and reliable structure analysis using conventional or synchrotron radiation CD spectroscopy, which is especially useful when X-ray or NMR techniques fail.

Journal Article

Share this book

Add to My Shelf

Developing a molecular dynamics force field for both folded and disordered protein states

by Shaw, David E. , Piana, Stefano , Robustelli, Paul in Benchmarks , Biophysics and Computational Biology , Computer simulation

2018

Molecular dynamics (MD) simulation is a valuable tool for characterizing the structural dynamics of folded proteins and should be similarly applicable to disordered proteins and proteins with both folded and disordered regions. It has been unclear, however, whether any physical model (force field) used in MD simulations accurately describes both folded and disordered proteins. Here, we select a benchmark set of 21 systems, including folded and disordered proteins, simulate these systems with six state-of-theart force fields, and compare the results to over 9,000 available experimental data points. We find that none of the tested force fields simultaneously provided accurate descriptions of folded proteins, of the dimensions of disordered proteins, and of the secondary structure propensities of disordered proteins. Guided by simulation results on a subset of our benchmark, however, we modified parameters of one force field, achieving excellent agreement with experiment for disordered proteins, while maintaining state-of-the-art accuracy for folded proteins. The resulting force field, a99SB-disp, should thus greatly expand the range of biological systems amenable to MD simulation. A similar approach could be taken to improve other force fields.

Journal Article

Share this book

Add to My Shelf

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

by Zhang, Minjia , Nowaczynski, Arkadiusz , Bonneau, Richard in 631/114/1305 , 631/114/470 , Accuracy

2024

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein–ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model’s capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community. OpenFold is a trainable open-source implementation of AlphaFold2. It is fast and memory efficient, and the code and training data are available under a permissive license.

Journal Article

Share this book

Add to My Shelf

Modeling aspects of the language of life through transfer-learning protein sequences

by Rost, Burkhard , Elnaggar, Ahmed , Nechaev, Dmitrii in Algorithms , Amino Acid Sequence , Amino acids

2019

Background Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome . Both these problems are addressed by the new methodology introduced here. Results We introduced a novel way to represent protein sequences as continuous vectors ( embeddings ) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec ( Seq uence-to- Vec tor) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. Conclusion Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.

Journal Article

Share this book

Add to My Shelf

REDfold: accurate RNA secondary structure prediction using residual encoder-decoder network

by Chen, Chun-Chi , Chan, Yi-Ming in Accuracy , Algorithms , Base Sequence

2023

Background As the RNA secondary structure is highly related to its stability and functions, the structure prediction is of great value to biological research. The traditional computational prediction for RNA secondary prediction is mainly based on the thermodynamic model with dynamic programming to find the optimal structure. However, the prediction performance based on the traditional approach is unsatisfactory for further research. Besides, the computational complexity of the structure prediction using dynamic programming is O ( N 3 ) ; it becomes O ( N 6 ) for RNA structure with pseudoknots, which is computationally impractical for large-scale analysis. Results In this paper, we propose REDfold, a novel deep learning-based method for RNA secondary prediction. REDfold utilizes an encoder-decoder network based on CNN to learn the short and long range dependencies among the RNA sequence, and the network is further integrated with symmetric skip connections to efficiently propagate activation information across layers. Moreover, the network output is post-processed with constrained optimization to yield favorable predictions even for RNAs with pseudoknots. Experimental results based on the ncRNA database demonstrate that REDfold achieves better performance in terms of efficiency and accuracy, outperforming the contemporary state-of-the-art methods.

Journal Article

Share this book

Add to My Shelf

Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields

by Ma, Jianzhu , Xu, Jinbo , Wang, Sheng in 631/114/1305 , 631/114/2397 , 631/114/2411

2016

Protein secondary structure (SS) prediction is important for studying protein structure and function. When only the sequence (profile) information is used as input feature, currently the best predictors can obtain ~80% Q3 accuracy, which has not been improved in the past decade. Here we present DeepCNF (Deep Convolutional Neural Fields) for protein SS prediction. DeepCNF is a Deep Learning extension of Conditional Neural Fields (CNF), which is an integration of Conditional Random Fields (CRF) and shallow neural networks. DeepCNF can model not only complex sequence-structure relationship by a deep hierarchical architecture, but also interdependency between adjacent SS labels, so it is much more powerful than CNF. Experimental results show that DeepCNF can obtain ~84% Q3 accuracy, ~85% SOV score and ~72% Q8 accuracy, respectively, on the CASP and CAMEO test proteins, greatly outperforming currently popular predictors. As a general framework, DeepCNF can be used to predict other protein structure properties such as contact number, disorder regions and solvent accessibility.

Journal Article

Share this book

Add to My Shelf

RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning

by Paliwal, Kuldip , Zhou, Yaoqi , Singh, Jaswinder in 631/114/1305 , 631/114/2397 , Algorithms

2019

The majority of our human genome transcribes into noncoding RNAs with unknown structures and functions. Obtaining functional clues for noncoding RNAs requires accurate base-pairing or secondary-structure prediction. However, the performance of such predictions by current folding-based algorithms has been stagnated for more than a decade. Here, we propose the use of deep contextual learning for base-pair prediction including those noncanonical and non-nested (pseudoknot) base pairs stabilized by tertiary interactions. Since only < 250 nonredundant, high-resolution RNA structures are available for model training, we utilize transfer learning from a model initially trained with a recent high-quality bpRNA dataset of > 10,000 nonredundant RNAs made available through comparative analysis. The resulting method achieves large, statistically significant improvement in predicting all base pairs, noncanonical and non-nested base pairs in particular. The proposed method (SPOT-RNA), with a freely available server and standalone software, should be useful for improving RNA structure modeling, sequence alignment, and functional annotations. The limited availability of high-resolution 3D RNA structures for model training limits RNA secondary structure prediction. Here, the authors overcome this challenge by pre-training a DNN on a large set of predicted RNA structures and using transfer learning with high-resolution structures.

Journal Article

Share this book

Add to My Shelf

Molecular interactions underlying liquid−liquid phase separation of the FUS low-complexity domain

by Dignon, Gregory L , Parekh, Sapun H , Fawzi, Nicolas L in Complexity , Computer simulation , Engineering

2019

The low-complexity domain of the RNA-binding protein FUS (FUS LC) mediates liquid−liquid phase separation (LLPS), but the interactions between the repetitive SYGQ-rich sequence of FUS LC that stabilize the liquid phase are not known in detail. By combining NMR and Raman spectroscopy, mutagenesis, and molecular simulation, we demonstrate that heterogeneous interactions involving all residue types underlie LLPS of human FUS LC. We find no evidence that FUS LC adopts conformations with traditional secondary structure elements in the condensed phase; rather, it maintains conformational heterogeneity. We show that hydrogen bonding, π/sp2, and hydrophobic interactions all contribute to stabilizing LLPS of FUS LC. In addition to contributions from tyrosine residues, we find that glutamine residues also participate in contacts leading to LLPS of FUS LC. These results support a model in which FUS LC forms dynamic, multivalent interactions via multiple residue types and remains disordered in the densely packed liquid phase.

Journal Article

Share this book

Add to My Shelf

TruMPET: A New Method for Protein Secondary Structure Prediction Using Neural Networks Trained on Multiple Pre-Selected Physicochemical and Structural Features

by Milchevskiy, Yury V. , Kravatsky, Yury V. , Kravatskaya, Galina I. in Accuracy , Amino acids , Computational Biology - methods

2025

Protein structure prediction continues to pose multiple challenges, despite the progress made by ML. While recent deep learning models have achieved a strong performance using embeddings from protein language models, they often ignore non-canonical amino acids and rely heavily on sequence alignments or evolutionary profiles. Here, we present an improvement to this approach for predicting the secondary protein structure of DSSP classes solely from amino acid sequences. We suggest that ML feature sets should be generated from statistically significant mutually uncorrelated descriptors. The selection of statistically assessed descriptors, including predicting the physicochemical parameters of non-canonical amino acids, is a key component of the proposed method. The statistical significance and influence of each of the suggested features were assessed using a two-step Linear Discriminant Analysis, which permitted the evaluation of the statistical significance of each descriptor and their impact on model accuracy. We applied the set of 109 most influential statistically significant descriptors as a learning model for the two-layer Bi-LSTM network combined with ESMFold2 embeddings. Our method, TruMPET (Training upon Multiple Pre-selected Elements Technique), outperformed all other methods reported in the literature for the non-redundant datasets (CB513: DSSP Q3 = 91.36% and Q8 = 85.41%, TEST2018: DSSP Q3 = 90.64% and Q8 = 84.17%).

Journal Article

Share this book

Add to My Shelf

Template-based protein structure modeling using the RaptorX web server

by Lu, Hui , Wang, Haipeng , Wang, Zhiyong in 631/114/2411 , 631/1647/48 , 82/81

2012

A key challenge of modern biology is to uncover the functional role of the protein entities that compose cellular proteomes. To this end, the availability of reliable three-dimensional atomic models of proteins is often crucial. This protocol presents a community-wide web-based method using RaptorX ( http://raptorx.uchicago.edu/ ) for protein secondary structure prediction, template-based tertiary structure modeling, alignment quality assessment and sophisticated probabilistic alignment sampling. RaptorX distinguishes itself from other servers by the quality of the alignment between a target sequence and one or multiple distantly related template proteins (especially those with sparse sequence profiles) and by a novel nonlinear scoring function and a probabilistic-consistency algorithm. Consequently, RaptorX delivers high-quality structural models for many targets with only remote templates. At present, it takes RaptorX ∼35 min to finish processing a sequence of 200 amino acids. Since its official release in August 2011, RaptorX has processed ∼6,000 sequences submitted by ∼1,600 users from around the world.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter