Catalogue Search | MBRL

MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies

by Kang, Dongwan D. , An, Hong , Li, Feng in Algorithms , BASIC BIOLOGICAL SCIENCES , Bioinformatics

2019

We previously reported on MetaBAT, an automated metagenome binning software tool to reconstruct single genomes from microbial communities for subsequent analyses of uncultivated microbial species. MetaBAT has become one of the most popular binning tools largely due to its computational efficiency and ease of use, especially in binning experiments with a large number of samples and a large assembly. MetaBAT requires users to choose parameters to fine-tune its sensitivity and specificity. If those parameters are not chosen properly, binning accuracy can suffer, especially on assemblies of poor quality. Here, we developed MetaBAT 2 to overcome this problem. MetaBAT 2 uses a new adaptive binning algorithm to eliminate manual parameter tuning. We also performed extensive software engineering optimization to increase both computational and memory efficiency. Comparing MetaBAT 2 to alternative software tools on over 100 real world metagenome assemblies shows superior accuracy and computing speed. Binning a typical metagenome assembly takes only a few minutes on a single commodity workstation. We therefore recommend the community adopts MetaBAT 2 for their metagenome binning experiments. MetaBAT 2 is open source software and available at https://bitbucket.org/berkeleylab/metabat .

Journal Article

Share this book

Add to My Shelf

MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities

by Kang, Dongwan D. , Egan, Rob , Wang, Zhong in Algorithms , Animal behavior , Automation

2015

Grouping large genomic fragments assembled from shotgun metagenomic sequences to deconvolute complex microbial communities, or metagenome binning, enables the study of individual organisms and their interactions. Because of the complex nature of these communities, existing metagenome binning methods often miss a large number of microbial species. In addition, most of the tools are not scalable to large datasets. Here we introduce automated software called MetaBAT that integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. MetaBAT outperforms alternative methods in accuracy and computational efficiency on both synthetic and real metagenome datasets. It automatically forms hundreds of high quality genome bins on a very large assembly consisting millions of contigs in a matter of hours on a single node. MetaBAT is open source software and available at https://bitbucket.org/berkeleylab/metabat.

Journal Article

Share this book

Add to My Shelf

De novo Nanopore read quality improvement using deep learning

by Egan, Rob , LaPierre, Nathan , Wang, Wei in Algorithms , Artificial neural networks , BASIC BIOLOGICAL SCIENCES

2019

Background Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. Results Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. Conclusions MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub .

Journal Article

Share this book

Add to My Shelf

Ecogenomics of virophages and their giant virus hosts assessed through time series metagenomics

by Sullivan, Matthew B. , Chan, Leong-Keat , Egan, Rob in 631/158/855 , 631/326/596/2142 , 631/326/596/432

2017

Virophages are small viruses that co-infect eukaryotic cells alongside giant viruses ( Mimiviridae ) and hijack their machinery to replicate. While two types of virophages have been isolated, their genomic diversity and ecology remain largely unknown. Here we use time series metagenomics to identify and study the dynamics of 25 uncultivated virophage populations, 17 of which represented by complete or near-complete genomes, in two North American freshwater lakes. Taxonomic analysis suggests that these freshwater virophages represent at least three new candidate genera. Ecologically, virophage populations are repeatedly detected over years and evolutionary stable, yet their distinct abundance profiles and gene content suggest that virophage genera occupy different ecological niches. Co-occurrence analyses reveal 11 virophages strongly associated with uncultivated Mimiviridae , and three associated with eukaryotes among the Dinophyceae , Rhizaria, Alveolata , and Cryptophyceae groups. Together, these findings significantly augment virophage databases, help refine virophage taxonomy, and establish baseline ecological hypotheses and tools to study virophages in nature. Virophages are recently-identified small viruses that infect larger viruses, yet their diversity and ecological roles are poorly understood. Here, Roux and colleagues present time series metagenomics data revealing new virophage genera and their putative ecological interactions in two freshwater lakes.

Journal Article

Share this book

Add to My Shelf

Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen

by Woyke, Tanja , Sczyrba, Alexander , Clark, Douglas S. in Amino Acid Sequence , Animals , Bacteria - enzymology

2011

The paucity of enzymes that efficiently deconstruct plant polysaccharides represents a major bottleneck for industrial-scale conversion of cellulosic biomass into biofuels. Cow rumen microbes specialize in degradation of cellulosic plant material, but most members of this complex community resist cultivation. To characterize biomass-degrading genes and genomes, we sequenced and analyzed 268 gigabases of metagenomic DNA from microbes adherent to plant fiber incubated in cow rumen. From these data, we identified 27,755 putative carbohydrate-active genes and expressed 90 candidate proteins, of which 57% were enzymatically active against cellulosic substrates. We also assembled 15 uncultured microbial genomes, which were validated by complementary methods including single-cell genome sequencing. These data sets provide a substantially expanded catalog of genes and genomes participating in the deconstruction of cellulosic biomass.

Journal Article

Share this book

Add to My Shelf

Persistent memory as an effective alternative to random access memory in metagenome assembly

by Sun, Jingchao , Egan, Rob , Li, Yue in Algorithms , Analysis , Assembly

2022

Background The assembly of metagenomes decomposes members of complex microbe communities and allows the characterization of these genomes without laborious cultivation or single-cell metagenomics. Metagenome assembly is a process that is memory intensive and time consuming. Multi-terabyte sequences can become too large to be assembled on a single computer node, and there is no reliable method to predict the memory requirement due to data-specific memory consumption pattern. Currently, out-of-memory (OOM) is one of the most prevalent factors that causes metagenome assembly failures. Results In this study, we explored the possibility of using Persistent Memory (PMem) as a less expensive substitute for dynamic random access memory (DRAM) to reduce OOM and increase the scalability of metagenome assemblers. We evaluated the execution time and memory usage of three popular metagenome assemblers (MetaSPAdes, MEGAHIT, and MetaHipMer2) in datasets up to one terabase. We found that PMem can enable metagenome assemblers on terabyte-sized datasets by partially or fully substituting DRAM. Depending on the configured DRAM/PMEM ratio, running metagenome assemblies with PMem can achieve a similar speed as DRAM, while in the worst case it showed a roughly two-fold slowdown. In addition, different assemblers displayed distinct memory/speed trade-offs in the same hardware/software environment. Conclusions We demonstrated that PMem is capable of expanding the capacity of DRAM to allow larger metagenome assembly with a potential tradeoff in speed. Because PMem can be used directly without any application-specific code modification, these findings are likely to be generalized to other memory-intensive bioinformatics applications.

Journal Article

Share this book

Add to My Shelf

Genomic Features Predict Bacterial Life History Strategies in Soil, as Identified by Metagenomic Stable Isotope Probing

by Buckley, Daniel H. , Egan, Rob , Foster, Brian in Bacteria , Bacteria - genetics , Bacteria - metabolism

2023

Soil microbes are major players in the global carbon cycle, yet we still have little understanding of how the carbon cycle operates in soil communities. A major limitation is that carbon metabolism lacks discrete functional genes that define carbon transformations. Bacteria catalyze the formation and destruction of soil organic matter, but the bacterial dynamics in soil that govern carbon (C) cycling are not well understood. Life history strategies explain the complex dynamics of bacterial populations and activities based on trade-offs in energy allocation to growth, resource acquisition, and survival. Such trade-offs influence the fate of soil C, but their genomic basis remains poorly characterized. We used multisubstrate metagenomic DNA stable isotope probing to link genomic features of bacteria to their C acquisition and growth dynamics. We identify several genomic features associated with patterns of bacterial C acquisition and growth, notably genomic investment in resource acquisition and regulatory flexibility. Moreover, we identify genomic trade-offs defined by numbers of transcription factors, membrane transporters, and secreted products, which match predictions from life history theory. We further show that genomic investment in resource acquisition and regulatory flexibility can predict bacterial ecological strategies in soil. IMPORTANCE Soil microbes are major players in the global carbon cycle, yet we still have little understanding of how the carbon cycle operates in soil communities. A major limitation is that carbon metabolism lacks discrete functional genes that define carbon transformations. Instead, carbon transformations are governed by anabolic processes associated with growth, resource acquisition, and survival. We use metagenomic stable isotope probing to link genome information to microbial growth and carbon assimilation dynamics as they occur in soil. From these data, we identify genomic traits that can predict bacterial ecological strategies which define bacterial interactions with soil carbon.

Journal Article

Share this book

Add to My Shelf

Integrating chromatin conformation information in a self-supervised learning model improves metagenome binning

by Liachko, Ivan , Ho, Harrison , Yoshinaga, Yuko in Algorithms , BASIC BIOLOGICAL SCIENCES , Bioinformatics

2023

Metagenome binning is a key step, downstream of metagenome assembly, to group scaffolds by their genome of origin. Although accurate binning has been achieved on datasets containing multiple samples from the same community, the completeness of binning is often low in datasets with a small number of samples due to a lack of robust species co-abundance information. In this study, we exploited the chromatin conformation information obtained from Hi-C sequencing and developed a new reference-independent algorithm, Metagenome Binning with Abundance and Tetra-nucleotide frequencies—Long Range (metaBAT-LR), to improve the binning completeness of these datasets. This self-supervised algorithm builds a model from a set of high-quality genome bins to predict scaffold pairs that are likely to be derived from the same genome. Then, it applies these predictions to merge incomplete genome bins, as well as recruit unbinned scaffolds. We validated metaBAT-LR’s ability to bin-merge and recruit scaffolds on both synthetic and real-world metagenome datasets of varying complexity. Benchmarking against similar software tools suggests that metaBAT-LR uncovers unique bins that were missed by all other methods. MetaBAT-LR is open-source and is available at https://bitbucket.org/project-metabat/metabat-lr .

Journal Article

Share this book

Add to My Shelf

Terabase-scale metagenome coassembly with MetaHipMer

by Georganas, Evangelos , Copeland, Alex C. , Clum, Alicia in 631/114/2785/2302 , 631/114/794 , 631/326/2565/2142

2020

Metagenome sequence datasets can contain terabytes of reads, too many to be coassembled together on a single shared-memory computer; consequently, they have only been assembled sample by sample ( multiassembly ) and combining the results is challenging. We can now perform coassembly of the largest datasets using MetaHipMer , a metagenome assembler designed to run on supercomputers and large clusters of compute nodes. We have reported on the implementation of MetaHipMer previously; in this paper we focus on analyzing the impact of very large coassembly. In particular, we show that coassembly recovers a larger genome fraction than multiassembly and enables the discovery of more complete genomes, with lower error rates, whereas multiassembly recovers more dominant strain variation. Being able to coassemble a large dataset does not preclude one from multiassembly; rather, having a fast, scalable metagenome assembler enables a user to more easily perform coassembly and multiassembly, and assemble both abundant, high strain variation genomes, and low-abundance, rare genomes. We present several assemblies of terabyte datasets that could never be coassembled before, demonstrating MetaHipMer ’s scaling power. MetaHipMer is available for public use under an open source license and all datasets used in the paper are available for public download.

Journal Article

Share this book

Add to My Shelf

The parallelism motifs of genomic data analysis

by Teodoropol, Cristina , Georganas, Evangelos , Ellis, Marquita in BASIC BIOLOGICAL SCIENCES , bioinformatics , Discussion

2020

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter