Catalogue Search | MBRL

A guide to machine learning for biologists

by Moffat, Lewis , Jones, David T , Greener, Joe G in Artificial neural networks , Best practice , Biological activity

2022

The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.Machine learning is becoming a widely used tool for the analysis of biological data. However, for experimentalists, proper use of machine learning methods can be challenging. This Review provides an overview of machine learning techniques and provides guidance on their applications in biology.

Journal Article

Share this book

Add to My Shelf

Genome-wide prediction of disease variant effects with a deep protein language model

by Ntranos, Vasilis , Brandes, Nadav , Wang, Charlotte H. in 631/1647/48 , 631/208/191 , Agriculture

2023

Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects. A modified framework leveraging a protein language model (ESM1b) is used to predict all possible 450 million missense variant effects in the human genome and shows potential for generalizing to more complex genetic variations such as indels and stop-gains.

Journal Article

Share this book

Add to My Shelf

Using MetaboAnalyst 5.0 for LC–HRMS spectra processing, multi-omics integration and covariate adjustment of global metabolomics data

by Ewald, Jessica , Pang, Zhiqiang , Chang, Le in 631/114/2398 , 631/1647/48 , 631/45/320

2022

Liquid chromatography coupled with high-resolution mass spectrometry (LC–HRMS) has become a workhorse in global metabolomics studies with growing applications across biomedical and environmental sciences. However, outstanding bioinformatics challenges in terms of data processing, statistical analysis and functional interpretation remain critical barriers to the wider adoption of this technology. To help the user community overcome these barriers, we have made major updates to the well-established MetaboAnalyst platform ( www.metaboanalyst.ca ). This protocol extends the previous 2011 Nature Protocol by providing stepwise instructions on how to use MetaboAnalyst 5.0 to: optimize parameters for LC–HRMS spectra processing; obtain functional insights from peak list data; integrate metabolomics data with transcriptomics data or combine multiple metabolomics datasets; conduct exploratory statistical analysis with complex metadata. Parameter optimization may take ~2 h to complete depending on the server load, and the remaining three stages may be executed in ~60 min. LC–HRMS is used for metabolomics studies in the biomedical and environmental sciences. MetaboAnalyst ( metaboanalyst.ca ) can be used to address challenges in data processing, statistical analysis, functional interpretation and multi-omics integration.

Journal Article

Share this book

Add to My Shelf

Using deep learning to annotate the protein universe

by Bateman, Alex , Bileschi, Maxwell L. , Carter, Brandon in 631/114/1305 , 631/114/2410 , 631/1647/48

2022

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools. A deep learning model predicts protein functional annotations for unaligned amino acid sequences.

Journal Article

Share this book

Add to My Shelf

Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data

by Guan, Xiangnan , Qian, David Z. , Alumkal, Joshi J. in 631/114/2397 , 631/1647/48 , Agriculture

2022

Bulk and single cell measurements are integrated to identify phenotype-associated subpopulations of cells. Single-cell RNA sequencing (scRNA-seq) distinguishes cell types, states and lineages within the context of heterogeneous tissues. However, current single-cell data cannot directly link cell clusters with specific phenotypes. Here we present Scissor, a method that identifies cell subpopulations from single-cell data that are associated with a given phenotype. Scissor integrates phenotype-associated bulk expression data and single-cell data by first quantifying the similarity between each single cell and each bulk sample. It then optimizes a regression model on the correlation matrix with the sample phenotype to identify relevant subpopulations. Applied to a lung cancer scRNA-seq dataset, Scissor identified subsets of cells associated with worse survival and with TP53 mutations. In melanoma, Scissor discerned a T cell subpopulation with low PDCD1 / CTLA4 and high TCF7 expression associated with an immunotherapy response. Beyond cancer, Scissor was effective in interpreting facioscapulohumeral muscular dystrophy and Alzheimer’s disease datasets. Scissor identifies biologically and clinically relevant cell subpopulations from single-cell assays by leveraging phenotype and bulk-omics datasets.

Journal Article

Share this book

Add to My Shelf

Comprehensive analysis of single cell ATAC-seq data with SnapATAC

by Xie, Fangming , Behrens, M. Margarita , Preissl, Sebastian in 45/22 , 45/23 , 631/114

2021

Identification of the cis-regulatory elements controlling cell-type specific gene expression patterns is essential for understanding the origin of cellular diversity. Conventional assays to map regulatory elements via open chromatin analysis of primary tissues is hindered by sample heterogeneity. Single cell analysis of accessible chromatin (scATAC-seq) can overcome this limitation. However, the high-level noise of each single cell profile and the large volume of data pose unique computational challenges. Here, we introduce SnapATAC, a software package for analyzing scATAC-seq datasets. SnapATAC dissects cellular heterogeneity in an unbiased manner and map the trajectories of cellular states. Using the Nyström method, SnapATAC can process data from up to a million cells. Furthermore, SnapATAC incorporates existing tools into a comprehensive package for analyzing single cell ATAC-seq dataset. As demonstration of its utility, SnapATAC is applied to 55,592 single-nucleus ATAC-seq profiles from the mouse secondary motor cortex. The analysis reveals ~370,000 candidate regulatory elements in 31 distinct cell populations in this brain region and inferred candidate cell-type specific transcriptional regulators. Single cell analysis of transposase-accessible chromatin is deepening our understanding on the origins of cellular diversity, yet methods are limited by data sparsity. Here, the authors introduce SnapATAC, a pipeline to resolve cellular heterogeneity and reveal candidate regulatory elements across different cell populations.

Journal Article

Share this book

Add to My Shelf

A computational framework to explore large-scale biosynthetic diversity

by Yeong, Marley , Cruz-Morales, Pablo , Kautsar, Satria A. in 631/1647/48 , 631/92/349 , 631/92/60

2020

Genome mining has become a key technology to exploit natural product diversity. Although initially performed on a single-genome basis, the process is now being scaled up to mine entire genera, strain collections and microbiomes. However, no bioinformatic framework is currently available for effectively analyzing datasets of this size and complexity. In the present study, a streamlined computational workflow is provided, consisting of two new software tools: the ‘biosynthetic gene similarity clustering and prospecting engine’ (BiG-SCAPE), which facilitates fast and interactive sequence similarity network analysis of biosynthetic gene clusters and gene cluster families; and the ‘core analysis of syntenic orthologues to prioritize natural product gene clusters’ (CORASON), which elucidates phylogenetic relationships within and across these families. BiG-SCAPE is validated by correlating its output to metabolomic data across 363 actinobacterial strains and the discovery potential of CORASON is demonstrated by comprehensively mapping biosynthetic diversity across a range of detoxin/rimosamide-related gene cluster families, culminating in the characterization of seven detoxin analogues. Two bioinformatic tools, BiG-SCAPE and CORASON, enable sequence similarity network and phylogenetic analysis of gene clusters and their families across hundreds of strains and in large datasets, leading to the discovery of new natural products.

Journal Article

Share this book

Add to My Shelf

Metagenome analysis using the Kraken software suite

by Lu, Jennifer , Langmead, Ben , Steinegger, Martin in 631/1647/48 , 631/1647/794 , 631/208/212/2142

2022

Metagenomic experiments expose the wide range of microscopic organisms in any microbial environment through high-throughput DNA sequencing. The computational analysis of the sequencing data is critical for the accurate and complete characterization of the microbial community. To facilitate efficient and reproducible metagenomic analysis, we introduce a step-by-step protocol for the Kraken suite, an end-to-end pipeline for the classification, quantification and visualization of metagenomic datasets. Our protocol describes the execution of the Kraken programs, via a sequence of easy-to-use scripts, in two scenarios: (1) quantification of the species in a given metagenomics sample; and (2) detection of a pathogenic agent from a clinical sample taken from a human patient. The protocol, which is executed within 1–2 h, is targeted to biologists and clinicians working in microbiome or metagenomics analysis who are familiar with the Unix command-line environment. The authors provide a guide to using the Kraken suite for metagenomics analysis, including classification, quantification and visualization, illustrated by quantification of species in the microbiome and identification of pathogens in a clinical sample.

Journal Article

Share this book

Add to My Shelf

Principles and methods for transferring polygenic risk scores across global populations

by Pasaniuc, Bogdan , Kachuri, Linda , Kenny, Eimear E in Accuracy , Disease , Genealogy

2024

Polygenic risk scores (PRSs) summarize the genetic predisposition of a complex human trait or disease and may become a valuable tool for advancing precision medicine. However, PRSs that are developed in populations of predominantly European genetic ancestries can increase health disparities due to poor predictive performance in individuals of diverse and complex genetic ancestries. We describe genetic and modifiable risk factors that limit the transferability of PRSs across populations and review the strengths and weaknesses of existing PRS construction methods for diverse ancestries. Developing PRSs that benefit global populations in research and clinical settings provides an opportunity for innovation and is essential for health equity.This Review summarizes the genetic and non-genetic factors that impact the transferability of polygenic risk scores (PRSs) across populations, highlighting the technical challenges of existing PRS construction methods for diverse ancestries and the emerging resources for more widespread use of PRSs.

Journal Article

Share this book

Add to My Shelf

The normative modeling framework for computational psychiatry

by Worker, Amanda , Fraza, Charlotte , Verdi, Serena in 631/1647/48 , 631/378 , 631/477

2022

Normative modeling is an emerging and innovative framework for mapping individual differences at the level of a single subject or observation in relation to a reference model. It involves charting centiles of variation across a population in terms of mappings between biology and behavior, which can then be used to make statistical inferences at the level of the individual. The fields of computational psychiatry and clinical neuroscience have been slow to transition away from patient versus ‘healthy’ control analytic approaches, probably owing to a lack of tools designed to properly model biological heterogeneity of mental disorders. Normative modeling provides a solution to address this issue and moves analysis away from case–control comparisons that rely on potentially noisy clinical labels. Here we define a standardized protocol to guide users through, from start to finish, normative modeling analysis using the Predictive Clinical Neuroscience toolkit (PCNtoolkit). We describe the input data selection process, provide intuition behind the various modeling choices and conclude by demonstrating several examples of downstream analyses that the normative model may facilitate, such as stratification of high-risk individuals, subtyping and behavioral predictive modeling. The protocol takes ~1–3 h to complete. This protocol guides the user through normative modeling analysis using the Predictive Clinical Neuroscience toolkit (PCNtoolkit), enabling individual differences to be mapped at the level of a single subject or observation in relation to a reference model.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter