Catalogue Search | MBRL

Tackling the widespread and critical impact of batch effects in high-throughput data

by Baggerly, Keith , Scharpf, Robert B. , Irizarry, Rafael A. in 631/1647/1513 , 631/1647/48 , Agriculture

2010

Batch effects can lead to incorrect biological conclusions but are not widely considered. The authors show that batch effects are relevant to a range of high-throughput 'omics' data sets and are crucial to address. They also explain how batch effects can be mitigated. High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.

Journal Article

Share this book

Add to My Shelf

Microbial resolution of whole genome shotgun and 16S amplicon metagenomic sequencing using publicly available NEON data

by Brumfield, Kyle D. , Leddy, Menu B. , Olds, James L. in Access to information , Annotations , Aquatic ecosystems

2020

Microorganisms are ubiquitous in the biosphere, playing a crucial role in both biogeochemistry of the planet and human health. However, identifying these microorganisms and defining their function are challenging. Widely used approaches in comparative metagenomics, 16S amplicon sequencing and whole genome shotgun sequencing (WGS), have provided access to DNA sequencing analysis to identify microorganisms and evaluate diversity and abundance in various environments. However, advances in parallel high-throughput DNA sequencing in the past decade have introduced major hurdles, namely standardization of methods, data storage, reproducible interoperability of results, and data sharing. The National Ecological Observatory Network (NEON), established by the National Science Foundation, enables all researchers to address queries on a regional to continental scale around a variety of environmental challenges and provide high-quality, integrated, and standardized data from field sites across the U.S. As the amount of metagenomic data continues to grow, standardized procedures that allow results across projects to be assessed and compared is becoming increasingly important in the field of metagenomics. We demonstrate the feasibility of using publicly available NEON soil metagenomic sequencing datasets in combination with open access Metagenomics Rapid Annotation using the Subsystem Technology (MG-RAST) server to illustrate advantages of WGS compared to 16S amplicon sequencing. Four WGS and four 16S amplicon sequence datasets, from surface soil samples prepared by NEON investigators, were selected for comparison, using standardized protocols collected at the same locations in Colorado between April-July 2014. The dominant bacterial phyla detected across samples agreed between sequencing methodologies. However, WGS yielded greater microbial resolution, increased accuracy, and allowed identification of more genera of bacteria, archaea, viruses, and eukaryota, and putative functional genes that would have gone undetected using 16S amplicon sequencing. NEON open data will be useful for future studies characterizing and quantifying complex ecological processes associated with changing aquatic and terrestrial ecosystems.

Journal Article

Share this book

Add to My Shelf

Single-cell meta-analysis of SARS-CoV-2 entry genes across tissues and demographics

by Vaishnav, Eeshit Dhaval , Montoro, Daniel T. , Smillie, Christopher in 631/114 , 631/250 , 631/326/596/4130

2021

Angiotensin-converting enzyme 2 (ACE2) and accessory proteases (TMPRSS2 and CTSL) are needed for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) cellular entry, and their expression may shed light on viral tropism and impact across the body. We assessed the cell-type-specific expression of ACE2 , TMPRSS2 and CTSL across 107 single-cell RNA-sequencing studies from different tissues. ACE2 , TMPRSS2 and CTSL are coexpressed in specific subsets of respiratory epithelial cells in the nasal passages, airways and alveoli, and in cells from other organs associated with coronavirus disease 2019 (COVID-19) transmission or pathology. We performed a meta-analysis of 31 lung single-cell RNA-sequencing studies with 1,320,896 cells from 377 nasal, airway and lung parenchyma samples from 228 individuals. This revealed cell-type-specific associations of age, sex and smoking with expression levels of ACE2 , TMPRSS2 and CTSL . Expression of entry factors increased with age and in males, including in airway secretory cells and alveolar type 2 cells. Expression programs shared by ACE2 + TMPRSS2 + cells in nasal, lung and gut tissues included genes that may mediate viral entry, key immune functions and epithelial–macrophage cross-talk, such as genes involved in the interleukin-6, interleukin-1, tumor necrosis factor and complement pathways. Cell-type-specific expression patterns may contribute to the pathogenesis of COVID-19, and our work highlights putative molecular pathways for therapeutic intervention. An integrated analysis of over 100 single-cell and single-nucleus transcriptomics studies illustrates severe acute respiratory syndrome coronavirus 2 viral entry gene coexpression patterns across different human tissues, and shows association of age, smoking status and sex with viral entry gene expression in respiratory cell populations.

Journal Article

Share this book

Add to My Shelf

Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays

by Reid, Clifford A. , Mutch, Karl , Peterson, Joe in assays , Base Sequence , Biological and medical sciences

2010

Genome sequencing of large numbers of individuals promises to advance the understanding, treatment, and prevention of human diseases, among other applications. We describe a genome sequencing platform that achieves efficient imaging and low reagent consumption with combinatorial probe anchor ligation chemistry to independently assay each base from patterned nanoarrays of self-assembling DNA nanoballs. We sequenced three human genomes with this platform, generating an average of 45-to 87-fold coverage per genome and identifying 3.2 to 4.5 million sequence variants per genome. Validation of one genome data set demonstrates a sequence accuracy of about 1 false variant per 100 kilobases. The high accuracy, affordable cost of $4400 for sequencing consumables, and scalability of this platform enable complete human genome sequencing for the detection of rare variants in large-scale genetic studies.

Journal Article

Share this book

Add to My Shelf

Repetitive DNA and next-generation sequencing: computational challenges and solutions

by Salzberg, Steven L. , Treangen, Todd J. in 631/114/2785/2302 , 631/208/514/1948 , 631/208/514/2254

2012

Key Points New high-throughput sequencing technologies have spurred explosive growth in the use of sequencing to discover mutations and structural variants in the human genome and in the number of projects to sequence and assemble new genomes. Highly efficient algorithms have been developed to align next-generation sequences to genomes, and these algorithms use a variety of strategies to place repetitive reads. Ambiguous mapping of sequences that are derived from repetitive regions makes it difficult to identify true polymorphisms and to reconstruct transcripts. Short read lengths combined with mapping ambiguities lead to false reports of single-nucleotide polymorphisms, inserts, deletions and other sequence variants. When assembling a genome de novo , repetitive sequences can lead to erroneous rearrangements, deletions, collapsed repeats and other assembly errors. Long-range linking information from paired-end reads can overcome some of the difficulties in short-read assembly. Repeat sequences in DNA remain one of the most challenging aspects of next-generation sequencing data analysis and interpretation. This Review explains the problems and current strategies for handling repeats; ignoring repeats risks missing important biological information. Repetitive DNA sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed. We discuss the computational problems surrounding repeats and describe strategies used by current bioinformatics systems to solve them.

Journal Article

Share this book

Add to My Shelf

Benchmarking single-cell RNA-sequencing protocols for cell atlas projects

by Parekh, Swati , Bagnoli, Johannes W. , Heyn, Holger in 631/208/514/1949 , 631/61/212/2019 , Agriculture

2020

Single-cell RNA sequencing (scRNA-seq) is the leading technique for characterizing the transcriptomes of individual cells in a sample. The latest protocols are scalable to thousands of cells and are being used to compile cell atlases of tissues, organs and organisms. However, the protocols differ substantially with respect to their RNA capture efficiency, bias, scale and costs, and their relative advantages for different applications are unclear. In the present study, we generated benchmark datasets to systematically evaluate protocols in terms of their power to comprehensively describe cell types and states. We performed a multicenter study comparing 13 commonly used scRNA-seq and single-nucleus RNA-seq protocols applied to a heterogeneous reference sample resource. Comparative analysis revealed marked differences in protocol performance. The protocols differed in library complexity and their ability to detect cell-type markers, impacting their predictive value and suitability for integration into reference cell atlases. These results provide guidance both for individual researchers and for consortium projects such as the Human Cell Atlas. A multicenter study compares 13 commonly used single-cell RNA-seq protocols.

Journal Article

Share this book

Add to My Shelf

Next-generation transcriptome assembly

by Wang, Zhong , Martin, Jeffrey A. in 631/208/212/2019 , 631/208/514/1949 , 631/208/514/2254

2011

Key Points The protocols used for library construction, sequencing and data pre-processing can have a great impact on the quality of an assembled transcriptome and the accuracy of gene expression quantification. Before starting an RNA sequencing (RNA-seq) experiment, one should carefully consider using protocols that are strand-specific, that remove ribosomal RNA and that do not require PCR amplification of the template. Strand-specific RNA-seq protocols are important for correctly assembling overlapping transcripts, especially for compact genomes. The reference-based, or ab initio , assembly strategy requires a reference genome and uses much fewer computing resources than the de novo strategy. However, the quality of the genome and the ability of the short-read aligner to align reads across introns will directly influence the accuracy of the assembled transcripts when using the reference-based strategy. The de novo assembly strategy does not use a reference genome but instead uses a De Bruijn graph to represent overlaps between sequences and assemble transcripts. Most de novo approaches require significant computing resources: random access memory (RAM) is the typical limitation. However, de novo assemblers can assemble trans -spliced genes and novel transcripts that are not present in the genome assembly. To take full advantage of the current assembly strategies, a combined assembly approach should be considered that leverages the strengths of reference-based and de novo assembly strategies. Most transcriptome assemblers are still being developed, and the results from these programs should be evaluated using unbiased quantitative metrics. Transcriptome assembly involves an informatics approach to solve an experimental limitation. As sequencing strategies continually improve, it may no longer be necessary in the near future to assemble transcriptomes, as the read length will be longer than any individual transcript. Advances in sequencing technologies, assembly algorithms and computing power are making it feasible to assemble the entire transcriptome from short RNA reads. The article reviews the transcriptome assembly strategies, their advantages and limitations and how to apply them effectively. Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches — reference-based, de novo and combined strategies — along with some perspectives on transcriptome assembly in the near future.

Journal Article

Share this book

Add to My Shelf

Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing

by Malausa, Thibaut , Martin, Jean-François , Pech, Nicolas in Animal Genetics and Genomics , Genetics , Humans

2011

Background The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse et al. in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments. Results We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables. Conclusions The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

Journal Article

Share this book

Add to My Shelf

Real-time, portable genome sequencing for Ebola surveillance

by Koundouno, Raymond , Kosgey, Abigael , Severi, Ettore in 631/181/735 , 631/208/325 , 631/208/514/2254

2016

A nanopore DNA sequencer is used for real-time genomic surveillance of the Ebola virus epidemic in the field in Guinea; the authors demonstrate that it is possible to pack a genomic surveillance laboratory in a suitcase and transport it to the field for on-site virus sequencing, generating results within 24 hours of sample collection. Ebola virus genomics surveillance This paper reports the use of nanopore DNA sequencers (known as MinIONs) for real-time genomic surveillance of the Ebola virus epidemic, in the field in Guinea. The authors demonstrate that it is possible to pack a genomic surveillance laboratory in a suitcase and transport it to the field for on-site virus sequencing, generating results within 24 hours of sample collection. The Ebola virus disease epidemic in West Africa is the largest on record, responsible for over 28,599 cases and more than 11,299 deaths 1 . Genome sequencing in viral outbreaks is desirable to characterize the infectious agent and determine its evolutionary rate. Genome sequencing also allows the identification of signatures of host adaptation, identification and monitoring of diagnostic targets, and characterization of responses to vaccines and treatments. The Ebola virus (EBOV) genome substitution rate in the Makona strain has been estimated at between 0.87 × 10 −3 and 1.42 × 10 −3 mutations per site per year. This is equivalent to 16–27 mutations in each genome, meaning that sequences diverge rapidly enough to identify distinct sub-lineages during a prolonged epidemic 2 , 3 , 4 , 5 , 6 , 7 . Genome sequencing provides a high-resolution view of pathogen evolution and is increasingly sought after for outbreak surveillance. Sequence data may be used to guide control measures, but only if the results are generated quickly enough to inform interventions 8 . Genomic surveillance during the epidemic has been sporadic owing to a lack of local sequencing capacity coupled with practical difficulties transporting samples to remote sequencing facilities 9 . To address this problem, here we devise a genomic surveillance system that utilizes a novel nanopore DNA sequencing instrument. In April 2015 this system was transported in standard airline luggage to Guinea and used for real-time genomic surveillance of the ongoing epidemic. We present sequence data and analysis of 142 EBOV samples collected during the period March to October 2015. We were able to generate results less than 24 h after receiving an Ebola-positive sample, with the sequencing process taking as little as 15–60 min. We show that real-time genomic surveillance is possible in resource-limited settings and can be established rapidly to monitor outbreaks.

Journal Article

Share this book

Add to My Shelf

DNA sequencing at 40: past, present and future

by Shendure, Jay , Balasubramanian, Shankar , Schloss, Jeffery A. in 45/23 , 631/208 , 631/208/514

2017

This review commemorates the 40th anniversary of DNA sequencing, a period in which we have already witnessed multiple technological revolutions and a growth in scale from a few kilobases to the first human genome, and now to millions of human and a myriad of other genomes. DNA sequencing has been extensively and creatively repurposed, including as a ‘counter’ for a vast range of molecular phenomena. We predict that in the long view of history, the impact of DNA sequencing will be on a par with that of the microscope. The history and future potential of DNA sequencing, including the development of the underlying technologies and the expansion of its areas of application, are reviewed. DNA sequencing at 40 This year marks the 40th anniversary of the Sanger method for DNA sequencing, the most widely used sequencing method, pioneered by Fred Sanger and his team in 1977. Jay Shendure and colleagues review the evolution of sequencing technologies since their inception, highlighting the major milestones in the development, analyses and applications of genome sequencing over the past 40 years. Despite multiple technological revolutions and growth in scale, the authors see DNA sequencing as a relatively nascent technology in the grand scheme of scientific history. They review current emerging applications and discuss the continued evolution and future of DNA sequencing from population-scale resequencing to networks of portable sensors used for real-time monitoring in environmental settings.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter