Catalogue Search | MBRL

Next-generation genome annotation: we still struggle to get it right

by Salzberg, Steven L. in Animal Genetics and Genomics , Annotations , Automation

2019

While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that we have used for the past two decades. The sheer number of genomes necessitates the use of fully automated procedures for annotation, but errors in annotation are just as prevalent as they were in the past, if not more so. How are we to solve this growing problem?

Journal Article

Share this book

Add to My Shelf

Open questions: How many genes do we have?

by Salzberg, Steven L. in Annotations , Base sequence , Biomedical and Life Sciences

2018

Seventeen years after the initial publication of the human genome, we still haven’t found all of our genes. The answer turns out to be more complex than anyone had imagined when the Human Genome Project began.

Journal Article

Share this book

Add to My Shelf

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

by Salzberg, Steven L. , Steinegger, Martin in Animal Genetics and Genomics , Animals , Bacterial proteins

2020

Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator

Journal Article

Share this book

Add to My Shelf

The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies

by Salzberg, Steven L. , Zimin, Aleksey V. in Assemblies , Assembly , Biology and Life Sciences

2020

The introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8-15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to \"polish\" the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.

Journal Article

Share this book

Add to My Shelf

Ultrafast and accurate 16S rRNA microbial community analysis using Kraken 2

by Salzberg, Steven L. , Lu, Jennifer in Accuracy , Algorithms , Bacteria

2020

Background For decades, 16S ribosomal RNA sequencing has been the primary means for identifying the bacterial species present in a sample with unknown composition. One of the most widely used tools for this purpose today is the QIIME (Quantitative Insights Into Microbial Ecology) package. Recent results have shown that the newest release, QIIME 2, has higher accuracy than QIIME, MAPseq, and mothur when classifying bacterial genera from simulated human gut, ocean, and soil metagenomes, although QIIME 2 also proved to be the most computationally expensive. Kraken, first released in 2014, has been shown to provide exceptionally fast and accurate classification for shotgun metagenomics sequencing projects. Bracken, released in 2016, then provided users with the ability to accurately estimate species or genus relative abundances using Kraken classification results. Kraken 2, which matches the accuracy and speed of Kraken 1, now supports 16S rRNA databases, allowing for direct comparisons to QIIME and similar systems. Methods For a comprehensive assessment of each tool, we compare the computational resources and speed of QIIME 2’s q2-feature-classifier, Kraken 2, and Bracken in generating the three main 16S rRNA databases: Greengenes, SILVA, and RDP. For an evaluation of accuracy, we evaluated each tool using the same simulated 16S rRNA reads from human gut, ocean, and soil metagenomes that were previously used to compare QIIME, MAPseq, mothur, and QIIME 2. We evaluated accuracy based on the accuracy of the final genera read counts assigned by each tool. Finally, as Kraken 2 is the only tool providing per-read taxonomic assignments, we evaluate the sensitivity and precision of Kraken 2’s per-read classifications. Results For both the Greengenes and SILVA database, Kraken 2 and Bracken are up to 100 times faster at database generation. For classification, using the same data as previous studies, Kraken 2 and Bracken are up to 300 times faster, use 100x less RAM, and generate results that more accurate at 16S rRNA profiling than QIIME 2’s q2-feature-classifier. Conclusion Kraken 2 and Bracken provide a very fast, efficient, and accurate solution for 16S rRNA metataxonomic data analysis. BduTYvgx2U5MCkCSpFgxKv Video Abstract

Journal Article

Share this book

Add to My Shelf

Next-generation sequencing: insights to advance clinical investigations of the microbiome

by Sears, Cynthia L. , Pluznick, Jennifer L. , Salzberg, Steven L. in Development and progression , Dysbiosis , Genetic aspects

2022

Next-generation sequencing (NGS) technology has advanced our understanding of the human microbiome by allowing for the discovery and characterization of unculturable microbes with prediction of their function. Key NGS methods include 16S rRNA gene sequencing, shotgun metagenomic sequencing, and RNA sequencing. The choice of which NGS methodology to pursue for a given purpose is often unclear for clinicians and researchers. In this Review, we describe the fundamentals of NGS, with a focus on 16S rRNA and shotgun metagenomic sequencing. We also discuss pros and cons of each methodology as well as important concepts in data variability, study design, and clinical metadata collection. We further present examples of how NGS studies of the human microbiome have advanced our understanding of human disease pathophysiology across diverse clinical contexts, including the development of diagnostics and therapeutics. Finally, we share insights as to how NGS might further be integrated into and advance microbiome research and clinical care in the coming years.

Journal Article

Share this book

Add to My Shelf

Transcriptome assembly from long-read RNA-seq alignments with StringTie2

by Pertea, Mihaela , Razaghi, Roham , Kovaka, Sam in Accuracy , Animal Genetics and Genomics , Animals

2019

RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.

Journal Article

Share this book

Add to My Shelf

Removing contaminants from databases of draft genomes

by Salzberg, Steven L. , Lu, Jennifer in Acanthamoeba , Analysis , Bacteria

2018

Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of \"clean\" eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.

Journal Article

Share this book

Add to My Shelf

Repetitive DNA and next-generation sequencing: computational challenges and solutions

by Salzberg, Steven L. , Treangen, Todd J. in 631/114/2785/2302 , 631/208/514/1948 , 631/208/514/2254

2012

Key Points New high-throughput sequencing technologies have spurred explosive growth in the use of sequencing to discover mutations and structural variants in the human genome and in the number of projects to sequence and assemble new genomes. Highly efficient algorithms have been developed to align next-generation sequences to genomes, and these algorithms use a variety of strategies to place repetitive reads. Ambiguous mapping of sequences that are derived from repetitive regions makes it difficult to identify true polymorphisms and to reconstruct transcripts. Short read lengths combined with mapping ambiguities lead to false reports of single-nucleotide polymorphisms, inserts, deletions and other sequence variants. When assembling a genome de novo , repetitive sequences can lead to erroneous rearrangements, deletions, collapsed repeats and other assembly errors. Long-range linking information from paired-end reads can overcome some of the difficulties in short-read assembly. Repeat sequences in DNA remain one of the most challenging aspects of next-generation sequencing data analysis and interpretation. This Review explains the problems and current strategies for handling repeats; ignoring repeats risks missing important biological information. Repetitive DNA sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed. We discuss the computational problems surrounding repeats and describe strategies used by current bioinformatics systems to solve them.

Journal Article

Share this book

Add to My Shelf

MUMmer4: A fast and versatile genome alignment system

by Phillippy, Adam M. , Salzberg, Steven L. , Coston, Rachel in Algorithms , Alignment , Animals

2018

The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes; we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter