Catalogue Search | MBRL

The generative capacity of probabilistic protein sequence models

by Haldane, Allan , Novinger, Quentin , Hauri, Sandro in 631/114/1305 , 631/114/2397 , 639/766/530/2804

2021

Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the “generative capacity” of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model’s generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE’s lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general. Generative models have become increasingly popular in protein design, yet rigorous metrics that allow the comparison of these models are lacking. Here, the authors propose a set of such metrics and use them to compare three popular models.

Journal Article

Share this book

Add to My Shelf

Leveraging shortest dependency paths in low-resource biomedical relation extraction

by Enayati, Saman , Vucetic, Slobodan in Accuracy , BERT , Biomedical data

2024

Background Biomedical Relation Extraction (RE) is essential for uncovering complex relationships between biomedical entities within text. However, training RE classifiers is challenging in low-resource biomedical applications with few labeled examples. Methods We explore the potential of Shortest Dependency Paths (SDPs) to aid biomedical RE, especially in situations with limited labeled examples. In this study, we suggest various approaches to employ SDPs when creating word and sentence representations under supervised, semi-supervised, and in-context-learning settings. Results Through experiments on three benchmark biomedical text datasets, we find that incorporating SDP-based representations enhances the performance of RE classifiers. The improvement is especially notable when working with small amounts of labeled data. Conclusion SDPs offer valuable insights into the complex sentence structure found in many biomedical text passages. Our study introduces several straightforward techniques that, as demonstrated experimentally, effectively enhance the accuracy of RE classifiers.

Journal Article

Share this book

Add to My Shelf

A new clustering and nomenclature for beta turns derived from high-resolution protein structures

by Dunbrack, Roland L. , Shapovalov, Maxim , Vucetic, Slobodan in Algorithms , Amino Acid Sequence , Amino acids

2019

Protein loops connect regular secondary structures and contain 4-residue beta turns which represent 63% of the residues in loops. The commonly used classification of beta turns (Type I, I', II, II', VIa1, VIa2, VIb, and VIII) was developed in the 1970s and 1980s from analysis of a small number of proteins of average resolution, and represents only two thirds of beta turns observed in proteins (with a generic class Type IV representing the rest). We present a new clustering of beta-turn conformations from a set of 13,030 turns from 1074 ultra-high resolution protein structures (≤1.2 Å). Our clustering is derived from applying the DBSCAN and k-medoids algorithms to this data set with a metric commonly used in directional statistics applied to the set of dihedral angles from the second and third residues of each turn. We define 18 turn types compared to the 8 classical turn types in common use. We propose a new 2-letter nomenclature for all 18 beta-turn types using Ramachandran region names for the two central residues (e.g., 'A' and 'D' for alpha regions on the left side of the Ramachandran map and 'a' and 'd' for equivalent regions on the right-hand side; classical Type I turns are 'AD' turns and Type I' turns are 'ad'). We identify 11 new types of beta turn, 5 of which are sub-types of classical beta-turn types. Up-to-date statistics, probability densities of conformations, and sequence profiles of beta turns in loops were collected and analyzed. A library of turn types, BetaTurnLib18, and cross-platform software, BetaTurnTool18, which identifies turns in an input protein structure, are freely available and redistributable from dunbrack.fccc.edu/betaturn and github.com/sh-maxim/BetaTurn18. Given the ubiquitous nature of beta turns, this comprehensive study updates understanding of beta turns and should also provide useful tools for protein structure determination, refinement, and prediction programs.

Journal Article

Share this book

Add to My Shelf

MS-kNN: protein function prediction by integrating multiple data sources

by Lan, Liang , Vucetic, Slobodan , Djuric, Nemanja in Algorithms , Genomics , Humans

2013

Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source k-Nearest Neighbor (MS-kNN) algorithm for function prediction, which finds k-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions. We report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-kNN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-kNN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-kNN was rather small. Based on our results, we have several useful insights: (1) the k-nearest neighbor algorithm is an efficient and effective model for protein function prediction; (2) it is beneficial to transfer functions across a wide range of organisms; (3) it is helpful to integrate multiple sources of protein information.

Journal Article

Share this book

Add to My Shelf

Improving medical term embeddings using UMLS Metathesaurus

by Vucetic, Slobodan , Chanda, Ashis Kumar , Bai, Tian in Algorithms , Analysis , Computational linguistics

2022

Background Health providers create Electronic Health Records (EHRs) to describe the conditions and procedures used to treat their patients. Medical notes entered by medical staff in the form of free text are a particularly insightful component of EHRs. There is a great interest in applying machine learning tools on medical notes in numerous medical informatics applications. Learning vector representations, or embeddings, of terms in the notes, is an important pre-processing step in such applications. However, learning good embeddings is challenging because medical notes are rich in specialized terminology, and the number of available EHRs in practical applications is often very small. Methods In this paper, we propose a novel algorithm to learn embeddings of medical terms from a limited set of medical notes. The algorithm, called definition2vec , exploits external information in the form of medical term definitions. It is an extension of a skip-gram algorithm that incorporates textual definitions of medical terms provided by the Unified Medical Language System (UMLS) Metathesaurus. Results To evaluate the proposed approach, we used a publicly available Medical Information Mart for Intensive Care (MIMIC-III) EHR data set. We performed quantitative and qualitative experiments to measure the usefulness of the learned embeddings. The experimental results show that definition2vec keeps the semantically similar medical terms together in the embedding vector space even when they are rare or unobserved in the corpus. We also demonstrate that learned vector embeddings are helpful in downstream medical informatics applications. Conclusion This paper shows that medical term definitions can be helpful when learning embeddings of rare or previously unseen medical terms from a small corpus of specialized documents such as medical notes.

Journal Article

Share this book

Add to My Shelf

Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction

by Dunbrack, Roland L. , Shapovalov, Maxim , Vucetic, Slobodan in Ablation , Accuracy , Acids

2020

Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.

Journal Article

Share this book

Add to My Shelf

Evolutionary sparse learning reveals the shared genetic basis of convergent traits

by Patel, Ravi , Gerhard, Glenn S. , Sharma, Sudip in 631/1647/2217/748 , 631/181/2474 , 631/181/735

2025

Cases abound in which nearly identical traits have appeared in distant species facing similar environments. These unmistakable examples of adaptive evolution offer opportunities to gain insight into their genetic origins and mechanisms through comparative analyses. Here, we present an approach to build genetic models that underlie the independent origins of convergent traits using evolutionary sparse learning with paired species contrast (ESL-PSC). We tested the hypothesis that common genes and sites are involved in the convergent evolution of two key traits: C4 photosynthesis in grasses and echolocation in mammals. Genetic models were highly predictive of independent cases of convergent evolution of C4 photosynthesis. Genes contributing to genetic models for echolocation were highly enriched for functional categories related to hearing, sound perception, and deafness, a pattern that has eluded previous efforts applying standard molecular evolutionary approaches. These results support the involvement of sequence substitutions at common genetic loci in the evolution of convergent traits. Benchmarking on empirical and simulated datasets showed that ESL-PSC could be more sensitive in proteome-scale analyses to detect genes with convergent molecular evolution associated with the acquisition of convergent traits. We conclude that phylogeny-informed machine learning naturally excludes apparent molecular convergences due to shared species history, enhances the signal-to-noise ratio for detecting molecular convergence, and empowers the discovery of common genetic bases of trait convergences. In convergent trait evolution, similar genetic changes may occur independently, which can help to pinpoint biological mechanisms. Here, the authors present a machine learning technique that links genes to convergent phenotypes, revealing strong enrichments of hearing-related proteins underlying echolocation in mammals.

Journal Article

Share this book

Add to My Shelf

Geographic clustering of cutaneous T-cell lymphoma in New Jersey

by Henry, Kevin A. , Maiti, Aniruddha , Stroup, Antoinette M. in Biomedical and Life Sciences , Biomedicine , Cancer Research

2021

Purpose Cutaneous T-cell lymphoma (CTCL) is a rare type of non-Hodgkin lymphoma. Previous studies have reported geographic clustering of CTCL based on the residence at the time of diagnosis. We explore geographic clustering of CTCL using both the residence at the time of diagnosis and past residences using data from the New Jersey State Cancer Registry. Methods CTCL cases ( n = 1,163) diagnosed between 2006–2014 were matched to colon cancer controls ( n = 17,049) on sex, age, race/ethnicity, and birth year. Jacquez's Q-Statistic was used to identify temporal clustering of cases compared to controls. Geographic clustering was assessed using the Bernoulli-based scan-statistic to compare cases to controls, and the Poisson-based scan-statisic to compare the observed number of cases to the number expected based on the general population. Significant clusters ( p < 0.05) were mapped, and standard incidence ratios (SIR) reported. We adjusted for diagnosis year, sex, and age. Results The Q-statistic identified significant temporal clustering of cases based on past residences in the study area from 1992 to 2002. A cluster was detected in 1992 in Bergen County in northern New Jersey based on the Bernoulli (1992 SIR 1.84) and Poisson (1992 SIR 1.86) scan-statistics. Using the Poisson scan-statistic with the diagnosis location, we found evidence of an elevated risk in this same area, but the results were not statistically significant. Conclusion There is evidence of geographic clustering of CTCL cases in New Jersey based on past residences. Additional studies are necessary to understand the possible reasons for the excess of CTCL cases living in this specific area some 8–14 years prior to diagnosis.

Journal Article

Share this book

Add to My Shelf

EHR phenotyping via jointly embedding medical concepts and words into a unified vector space

by Vucetic, Slobodan , Egleston, Brian L. , Bai, Tian in Bioinformatics , Clinical Coding , Codes

2018

Background There has been an increasing interest in learning low-dimensional vector representations of medical concepts from Electronic Health Records (EHRs). Vector representations of medical concepts facilitate exploratory analysis and predictive modeling of EHR data to gain insights about the patterns of care and health outcomes. EHRs contain structured data such as diagnostic codes and laboratory tests, as well as unstructured free text data in form of clinical notes, which provide more detail about condition and treatment of patients. Methods In this work, we propose a method that jointly learns vector representations of medical concepts and words. This is achieved by a novel learning scheme based on the word2vec model. Our model learns those relationships by integrating clinical notes and sets of accompanying medical codes and by defining joint contexts for each observed word and medical code. Results In our experiments, we learned joint representations using MIMIC-III data. Using the learned representations of words and medical codes, we evaluated phenotypes for 6 diseases discovered by our and baseline method. The experimental results show that for each of the 6 diseases our method finds highly relevant words. We also show that our representations can be very useful when predicting the reason for the next visit. Conclusions The jointly learned representations of medical concepts and words capture not only similarity between codes or words themselves, but also similarity between codes and words. They can be used to extract phenotypes of different diseases. The representations learned by the joint model are also useful for construction of patient features.

Journal Article

Share this book

Add to My Shelf

Unfoldomics of human diseases: linking protein intrinsic disorder with diseases

by Xie, Hongbo , Iakoucheva, Lilia M , Vucetic, Slobodan in Alternative Splicing , Animal Genetics and Genomics , Biomedical and Life Sciences

2009

Background Intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) lack stable tertiary and/or secondary structure yet fulfills key biological functions. The recent recognition of IDPs and IDRs is leading to an entire field aimed at their systematic structural characterization and at determination of their mechanisms of action. Bioinformatics studies showed that IDPs and IDRs are highly abundant in different proteomes and carry out mostly regulatory functions related to molecular recognition and signal transduction. These activities complement the functions of structured proteins. IDPs and IDRs were shown to participate in both one-to-many and many-to-one signaling. Alternative splicing and posttranslational modifications are frequently used to tune the IDP functionality. Several individual IDPs were shown to be associated with human diseases, such as cancer, cardiovascular disease, amyloidoses, diabetes, neurodegenerative diseases, and others. This raises questions regarding the involvement of IDPs and IDRs in various diseases. Results IDPs and IDRs were shown to be highly abundant in proteins associated with various human maladies. As the number of IDPs related to various diseases was found to be very large, the concepts of the disease-related unfoldome and unfoldomics were introduced. Novel bioinformatics tools were proposed to populate and characterize the disease-associated unfoldome. Structural characterization of the members of the disease-related unfoldome requires specialized experimental approaches. IDPs possess a number of unique structural and functional features that determine their broad involvement into the pathogenesis of various diseases. Conclusion Proteins associated with various human diseases are enriched in intrinsic disorder. These disease-associated IDPs and IDRs are real, abundant, diversified, vital, and dynamic. These proteins and regions comprise the disease-related unfoldome, which covers a significant part of the human proteome. Profound association between intrinsic disorder and various human diseases is determined by a set of unique structural and functional characteristics of IDPs and IDRs. Unfoldomics of human diseases utilizes unrivaled bioinformatics and experimental techniques, paves the road for better understanding of human diseases, their pathogenesis and molecular mechanisms, and helps develop new strategies for the analysis of disease-related proteins.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter