Catalogue Search | MBRL

Using deep learning to annotate the protein universe

by Bateman, Alex , Bileschi, Maxwell L. , Carter, Brandon in 631/114/1305 , 631/114/2410 , 631/1647/48

2022

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools. A deep learning model predicts protein functional annotations for unaligned amino acid sequences.

Journal Article

Share this book

Add to My Shelf

A universal SNP and small-indel variant caller using deep neural networks

by Schwartz, Scott , Newburger, Dan , McLean, Cory Y in 45/23 , 631/114/1305 , 631/114/2785

2018

DeepVariant uses convolutional neural networks to improve the accuracy of variant calling. Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.

Journal Article

Share this book

Add to My Shelf

A framework for variation discovery and genotyping using next-generation DNA sequencing data

by Rivas, Manuel A , Philippakis, Anthony A , Banks, Eric in 631/208/2489/144 , 631/208/514/2254 , Agriculture

2011

Mark DePristo and colleagues report an analytical framework to discover and genotype variation using whole exome and genome resequencing data from next-generation sequencing technologies. They apply these methods to low-pass population sequencing data from the 1000 Genomes Project. Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.

Journal Article

Share this book

Add to My Shelf

RNA profiles reveal signatures of future health and disease in pregnancy

by McElrath, Thomas F. , Roberts, James M. , Litch, James A. in 38/39 , 45/90 , 631/208/199

2022

Maternal morbidity and mortality continue to rise, and pre-eclampsia is a major driver of this burden 1 . Yet the ability to assess underlying pathophysiology before clinical presentation to enable identification of pregnancies at risk remains elusive. Here we demonstrate the ability of plasma cell-free RNA (cfRNA) to reveal patterns of normal pregnancy progression and determine the risk of developing pre-eclampsia months before clinical presentation. Our results centre on comprehensive transcriptome data from eight independent prospectively collected cohorts comprising 1,840 racially diverse pregnancies and retrospective analysis of 2,539 banked plasma samples. The pre-eclampsia data include 524 samples (72 cases and 452 non-cases) from two diverse independent cohorts collected 14.5 weeks (s.d., 4.5 weeks) before delivery. We show that cfRNA signatures from a single blood draw can track pregnancy progression at the placental, maternal and fetal levels and can robustly predict pre-eclampsia, with a sensitivity of 75% and a positive predictive value of 32.3% (s.d., 3%), which is superior to the state-of-the-art method 2 . cfRNA signatures of normal pregnancy progression and pre-eclampsia are independent of clinical factors, such as maternal age, body mass index and race, which cumulatively account for less than 1% of model variance. Further, the cfRNA signature for pre-eclampsia contains gene features linked to biological processes implicated in the underlying pathophysiology of pre-eclampsia. Expression signatures from cell-free RNA of pregnant women can be used to reveal normal biology of pregnancy and predict development of pre-eclampsia.

Journal Article

Share this book

Add to My Shelf

Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome

by Ebler, Jana , Schatz, Michael C , Rank, David R in Assembly , Deoxyribonucleic acid , DNA sequencing

2019

The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the ‘genome in a bottle’ (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.

Journal Article

Share this book

Add to My Shelf

Pacific biosciences sequencing technology for genotyping and variation discovery in human data

by Russ, Carsten , Nusbaum, Chad , DePristo, Mark A in Analysis , Animal Genetics and Genomics , Biomedical and Life Sciences

2012

Background Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical amplicon resequencing projects. Results We evaluated the Pacific Biosciences technology for SNP discovery in medical resequencing projects using the Genome Analysis Toolkit, observing high sensitivity and specificity for calling differences in amplicons containing known true or false SNPs. We assessed data quality: most errors were indels (~14%) with few apparent miscalls (~1%). In this work, we define a custom data processing pipeline for Pacific Biosciences data for human data analysis. Conclusion Critically, the error properties were largely free of the context-specific effects that affect other sequencing technologies. These data show excellent utility for follow-up validation and extension studies in human data and medical genetics projects, but can be extended to other organisms with a reference genome.

Journal Article

Share this book

Add to My Shelf

Darwinian Evolution Can Follow Only Very Few Mutational Paths to Fitter Proteins

by Hartl, Daniel L , DePristo, Mark A , Weinreich, Daniel M in Alleles , Anti-Bacterial Agents - pharmacology , Antibiotics

2006

Five point mutations in a particular {szligbeta}-lactamase allele jointly increase bacterial resistance to a clinically important antibiotic by a factor of [approximately]100,000. In principle, evolution to this high-resistance {szligbeta}-lactamase might follow any of the 120 mutational trajectories linking these alleles. However, we demonstrate that 102 trajectories are inaccessible to Darwinian selection and that many of the remaining trajectories have negligible probabilities of realization, because four of these five mutations fail to increase drug resistance in some combinations. Pervasive biophysical pleiotropy within the {szligbeta}-lactamase seems to be responsible, and because such pleiotropy appears to be a general property of missense mutations, we conclude that much protein evolution will be similarly constrained. This implies that the protein tape of life may be largely reproducible and even predictable.

Journal Article

Share this book

Add to My Shelf

A framework for the interpretation of de novo mutation in human disease

by Purcell, Shaun M , Schellenberg, Gerard D , Buxbaum, Joseph D in 45/23 , 631/208/1516 , 631/208/212

2014

Mark Daly and colleagues present a statistical framework to evaluate the role of de novo mutations in human disease by calibrating a model of de novo mutation rates at the individual gene level. The mutation probabilities defined by their model and list of constrained genes can be used to help identify genetic variants that have a significant role in disease. Spontaneously arising ( de novo ) mutations have an important role in medical genetics. For diseases with extensive locus heterogeneity, such as autism spectrum disorders (ASDs), the signal from de novo mutations is distributed across many genes, making it difficult to distinguish disease-relevant mutations from background variation. Here we provide a statistical framework for the analysis of excesses in de novo mutation per gene and gene set by calibrating a model of de novo mutation. We applied this framework to de novo mutations collected from 1,078 ASD family trios, and, whereas we affirmed a significant role for loss-of-function mutations, we found no excess of de novo loss-of-function mutations in cases with IQ above 100, suggesting that the role of de novo mutations in ASDs might reside in fundamental neurodevelopmental processes. We also used our model to identify ∼1,000 genes that are significantly lacking in functional coding variation in non-ASD samples and are enriched for de novo loss-of-function mutations identified in ASD cases.

Journal Article

Share this book

Add to My Shelf

Molecular subtyping of hypertensive disorders of pregnancy

by McElrath, Thomas F. , Biggio, Joseph R. , Gyamfi-Bannerman, Cynthia in 38/91 , 631/337/2019 , 692/4017

2025

Hypertensive disorders of pregnancy (HDP), including preeclampsia, affect 1 in 6 pregnancies, are major contributors to maternal morbidity and mortality, yet lack precision medicine strategies. Analyzing transcriptomic data from a prospectively-collected diverse cohort ( n = 9102), this study reveals distinct RNA subtypes in maternal blood, reclassifying clinical HDP phenotypes like early/late-onset preeclampsia. The placental gene PAPPA2 strongly predicts the most severe forms of preeclampsia in individuals without pre-existing high risk factors, months before symptoms, and its overexpression correlates with earlier delivery in a dose-dependent manner. Further, molecular subtypes characterized by immune genes are upregulated in less severe forms of HDP. These results reclassify HDP clinical phenotypes into two distinct molecular subtypes, placental-associated or immune-associated. Validation performance for placental-associated HDP yields an AUC of 0.88 in the advanced maternal age population without pre-existing high risk factors. Molecular subtypes create new opportunities to apply precision-based medicine in maternal health. The molecular etiology of hypertensive disorders of pregnancy is largely unknown. Here the authors show from a prospective study of diverse pregnancies that the disease can be split into molecular subtypes based on RNA data and validated a classifier for individuals with no preexisting high risk factors.

Journal Article

Share this book

Add to My Shelf

Exome Sequencing, ANGPTL3 Mutations, and Familial Combined Hypolipidemia

by Hobbs, Helen H , Engert, James C , Sougnez, Carrie in Angiopoietin-Like Protein 3 , Angiopoietin-like Proteins , Angiopoietins - genetics

2010

Two family members with combined hypolipidemia (low HDL and LDL cholesterol and low triglycerides) were evaluated and found to be compound heterozygotes, each for a different nonsense mutation in ANGPTL3, the gene encoding the angiopoietin-like 3 protein. Familial hypobetalipoproteinemia is an inherited disorder of lipid metabolism defined by very low levels (<5th percentile of age- and sex-specific values) of plasma apolipoprotein B and LDL cholesterol. Familial hypobetalipoproteinemia is genetically heterogeneous. 1 , 2 The best-characterized cases have been linked to mutations in the gene encoding apolipoprotein B ( APOB ) that lead to less apolipoprotein B synthesis and reduced secretion of very-low-density lipoprotein (VLDL) from the liver. As a consequence of impaired hepatic export of VLDL, persons with familial hypobetalipoproteinemia due to a deficiency of apolipoprotein B are prone to hepatic steatosis. 3 , 4 Persons with hypobetalipoproteinemia may also have . . .

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter