Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
1,409
result(s) for
"sequencing errors"
Sort by:
Sequencing accuracy and systematic errors of nanopore direct RNA sequencing
2024
Background
Direct RNA sequencing (dRNA-seq) on the Oxford Nanopore Technologies (ONT) platforms can produce reads covering up to full-length gene transcripts, while containing decipherable information about RNA base modifications and poly-A tail lengths. Although many published studies have been expanding the potential of dRNA-seq, its sequencing accuracy and error patterns remain understudied.
Results
We present the first comprehensive evaluation of sequencing accuracy and characterisation of systematic errors in dRNA-seq data from diverse organisms and synthetic in vitro transcribed RNAs. We found that for sequencing kits SQK-RNA001 and SQK-RNA002, the median read accuracy ranged from 87% to 92% across species, and deletions significantly outnumbered mismatches and insertions. Due to their high abundance in the transcriptome, heteropolymers and short homopolymers were the major contributors to the overall sequencing errors. We also observed systematic biases across all species at the levels of single nucleotides and motifs. In general, cytosine/uracil-rich regions were more likely to be erroneous than guanines and adenines. By examining raw signal data, we identified the underlying signal-level features potentially associated with the error patterns and their dependency on sequence contexts. While read quality scores can be used to approximate error rates at base and read levels, failure to detect DNA adapters may be a source of errors and data loss. By comparing distinct basecallers, we reason that some sequencing errors are attributable to signal insufficiency rather than algorithmic (basecalling) artefacts. Lastly, we generated dRNA-seq data using the latest SQK-RNA004 sequencing kit released at the end of 2023 and found that although the overall read accuracy increased, the systematic errors remain largely identical compared to the previous kits.
Conclusions
As the first systematic investigation of dRNA-seq errors, this study offers a comprehensive overview of reproducible error patterns across diverse datasets, identifies potential signal-level insufficiency, and lays the foundation for error correction methods.
Journal Article
NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors
2018
Background
Advances in Illumina DNA sequencing technology have produced longer paired-end reads that increasingly have sequence overlaps. These reads can be merged into a single read that spans the full length of the original DNA fragment, allowing for error correction and accurate determination of read coverage. Extant merging programs utilize simplistic or unverified models for the selection of bases and quality scores for the overlapping region of merged reads.
Results
We first examined the baseline quality score - error rate relationship using sequence reads derived from PhiX. In contrast to numerous published reports, we found that the quality scores produced by Illumina were not substantially inflated above the theoretical values, once the reference genome was corrected for unreported sequence variants. The PhiX reads were then used to create empirical models of sequencing errors in overlapping regions of paired-end reads, and these models were incorporated into a novel merging program, NGmerge. We demonstrate that NGmerge corrects errors and ambiguous bases better than other merging programs, and that it assigns quality scores for merged bases that accurately reflect the error rates. Our results also show that, contrary to published analyses, the sequencing errors of paired-end reads are not independent.
Conclusions
We provide a free and open-source program, NGmerge, that performs better than existing read merging programs. NGmerge is available on GitHub (
https://github.com/harvardinformatics/NGmerge
) under the MIT License; it is written in C and supported on Linux.
Journal Article
PacBio metabarcoding of Fungi and other eukaryotes
by
Ave Tooming-Klunderud
,
Sten Anslan
,
Leho Tedersoo
in
amplification bias
,
Bar codes
,
barcoding
2018
Second-generation, high-throughput sequencing methods have greatly improved our understanding of the ecology of soil microorganisms, yet the short barcodes (< 500 bp) provide limited taxonomic and phylogenetic information for species discrimination and taxonomic assignment.
Here, we utilized the third-generation Pacific Biosciences (PacBio) RSII and Sequel instruments to evaluate the suitability of full-length internal transcribed spacer (ITS) barcodes and longer rRNA gene amplicons for metabarcoding Fungi, Oomycetes and other eukaryotes in soil samples.
Metabarcoding revealed multiple errors and biases: Taq polymerase substitution errors and mis-incorporating indels in sequencing homopolymers constitute major errors; sequence length biases occur during PCR, library preparation, loading to the sequencing instrument and quality filtering; primer–template mismatches bias the taxonomic profile when using regular and highly degenerate primers.
The RSII and Sequel platforms enable the sequencing of amplicons up to 3000 bp, but the sequence quality remains slightly inferior to Illumina sequencing especially in longer amplicons. The full ITS barcode and flanking rRNA small subunit gene greatly improve taxonomic identification at the species and phylum levels, respectively. We conclude that PacBio sequencing provides a viable alternative for metabarcoding of organisms that are of relatively low diversity, require > 500-bp barcode for reliable identification or when phylogenetic approaches are intended.
Journal Article
Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data
by
Ijaz, Umer Z.
,
Quince, Christopher
,
D’Amore, Rosalinda
in
Algorithms
,
Antibiotic resistance
,
Archaea - genetics
2016
Background
Illumina’s sequencing platforms are currently the most utilised sequencing systems worldwide. The technology has rapidly evolved over recent years and provides high throughput at low costs with increasing read-lengths and true paired-end reads. However, data from any sequencing technology contains noise and our understanding of the peculiarities and sequencing errors encountered in Illumina data has lagged behind this rapid development.
Results
We conducted a systematic investigation of errors and biases in Illumina data based on the largest collection of
in vitro
metagenomic data sets to date. We evaluated the Genome Analyzer II, HiSeq and MiSeq and tested state-of-the-art low input library preparation methods. Analysing
in vitro
metagenomic sequencing data allowed us to determine biases directly associated with the actual sequencing process. The position- and nucleotide-specific analysis revealed a substantial bias related to motifs (3mers preceding errors) ending in “GG”. On average the top three motifs were linked to 16 % of all substitution errors. Furthermore, a preferential incorporation of ddGTPs was recorded. We hypothesise that all of these biases are related to the engineered polymerase and ddNTPs which are intrinsic to any sequencing-by-synthesis method. We show that quality-score-based error removal strategies can on average remove 69 % of the substitution errors - however, the motif-bias remains.
Conclusion
Single-nucleotide polymorphism changes in bacterial genomes can cause significant changes in phenotype, including antibiotic resistance and virulence, detecting them within metagenomes is therefore vital. Current error removal techniques are not designed to target the peculiarities encountered in Illumina sequencing data and other sequencing-by-synthesis methods, causing biases to persist and potentially affect any conclusions drawn from the data. In order to develop effective diagnostic and therapeutic approaches we need to be able to identify systematic sequencing errors and distinguish these errors from true genetic variation.
Journal Article
Quality filtering of Illumina index reads mitigates sample cross-talk
by
Wright, Erik Scott
,
Vetsigian, Kalin Horen
in
Analysis
,
Animal Genetics and Genomics
,
Biomedical and Life Sciences
2016
Background
Multiplexing multiple samples during Illumina sequencing is a common practice and is rapidly growing in importance as the throughput of the platform increases. Misassignments during de-multiplexing, where sequences are associated with the wrong sample, are an overlooked error mode on the Illumina sequencing platform. This results in a low rate of cross-talk among multiplexed samples and can cause detrimental effects in studies requiring the detection of rare variants or when multiplexing a large number of samples.
Results
We observed rates of cross-talk averaging 0.24 % when multiplexing 14 different samples with unique i5 and i7 index sequences. This cross-talk rate corresponded to 254,632 misassigned reads on a single lane of the Illumina HiSeq 2500. Notably, all types of misassignment occur at similar rates: incorrect i5, incorrect i7, and incorrect sequence reads. We demonstrate that misassignments can be nearly eliminated by quality filtering of index reads while preserving about 90 % of the original sequences.
Conclusions
Cross-talk among multiplexed samples is a significant error mode on the Illumina platform, especially if samples are only separated by a single unique index. Quality filtering of index sequences offers an effective solution to minimizing cross-talk among samples. Furthermore, we propose a straightforward method for verifying the extent of cross-talk between samples and optimizing quality score thresholds that does not require additional control samples and can even be performed
post hoc
on previous runs.
Journal Article
ConDeTri - A Content Dependent Read Trimmer for Illumina Data
2011
During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs. Particularly, Illumina/Solexa sequencing has had an increasing impact on gathering data from model and non-model organisms. However, accurate and easy to use tools for quality filtering have not yet been established. We present ConDeTri, a method for content dependent read trimming for next generation sequencing data using quality scores of each individual base. The main focus of the method is to remove sequencing errors from reads so that sequencing reads can be standardized. Another aspect of the method is to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequence data of arbitrary length and it is independent from sequencing coverage and user interaction. ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies. Low coverage or large genome sequencing projects will especially gain from trimming reads. The method can easily be incorporated into preprocessing and analysis pipelines for Illumina data.
Freely available on the web at http://code.google.com/p/condetri.
Journal Article
Hybrid-hybrid correction of errors in long reads with HERO
by
Kang, Xiongbin
,
Xu, Jialu
,
Schönhuth, Alexander
in
Accuracy
,
Algorithms
,
Animal Genetics and Genomics
2023
Although generally superior, hybrid approaches for correcting errors in third-generation sequencing (TGS) reads, using next-generation sequencing (NGS) reads, mistake haplotype-specific variants for errors in polyploid and mixed samples. We suggest HERO, as the first “hybrid-hybrid” approach, to make use of both de Bruijn graphs and overlap graphs for optimal catering to the particular strengths of NGS and TGS reads. Extensive benchmarking experiments demonstrate that HERO improves indel and mismatch error rates by on average 65% (27
∼
95%) and 20% (4
∼
61%). Using HERO prior to genome assembly significantly improves the assemblies in the majority of the relevant categories.
Journal Article
Accounting for Errors in Low Coverage High-Throughput Sequencing Data When Constructing Genetic Maps Using Biparental Outcrossed Populations
by
Bilton, Timothy P
,
Schofield, Matthew R
,
Black, Michael A
in
Algorithms
,
Bioinformatics
,
Construction
2018
Next generation sequencing-based genotyping platforms allow for the construction of high density genetic linkage maps. However, data generated using these platforms often contain errors resulting from miscalled bases and missing parental alleles that are due... Next-generation sequencing is an efficient method that allows for substantially more markers than previous technologies, providing opportunities for building high-density genetic linkage maps, which facilitate the development of nonmodel species’ genomic assemblies and the investigation of their genes. However, constructing genetic maps using data generated via high-throughput sequencing technology (e.g., genotyping-by-sequencing) is complicated by the presence of sequencing errors and genotyping errors resulting from missing parental alleles due to low sequencing depth. If unaccounted for, these errors lead to inflated genetic maps. In addition, map construction in many species is performed using full-sibling family populations derived from the outcrossing of two individuals, where unknown parental phase and varying segregation types further complicate construction. We present a new methodology for modeling low coverage sequencing data in the construction of genetic linkage maps using full-sibling populations of diploid species, implemented in a package called GUSMap. Our model is based on the Lander–Green hidden Markov model but extended to account for errors present in sequencing data. We were able to obtain accurate estimates of the recombination fractions and overall map distance using GUSMap, while most existing mapping packages produced inflated genetic maps in the presence of errors. Our results demonstrate the feasibility of using low coverage sequencing data to produce genetic maps without requiring extensive filtering of potentially erroneous genotypes, provided that the associated errors are correctly accounted for in the model.
Journal Article
Characterization and mitigation of artifacts derived from NGS library preparation due to structure-specific sequences in the human genome
by
Qian, Nian-Song
,
Yang, ChunYan
,
Chen, HuiJuan
in
Algorithms
,
Animal Genetics and Genomics
,
Biomedical and Life Sciences
2024
Background
Hybridization capture-based targeted next generation sequencing (NGS) is gaining importance in routine cancer clinical practice. DNA library preparation is a fundamental step to produce high-quality sequencing data. Numerous unexpected, low variant allele frequency calls were observed in libraries using sonication fragmentation and enzymatic fragmentation. In this study, we investigated the characteristics of the artifact reads induced by sonication and enzymatic fragmentation. We also developed a bioinformatic algorithm to filter these sequencing errors.
Results
We used pairwise comparisons of somatic single nucleotide variants (SNVs) and insertions and deletions (indels) of the same tumor DNA samples prepared using both ultrasonic and enzymatic fragmentation protocols. Our analysis revealed that the number of artifact variants was significantly greater in the samples generated using enzymatic fragmentation than using sonication. Most of the artifacts derived from the sonication-treated libraries were chimeric artifact reads containing both cis- and trans-inverted repeat sequences of the genomic DNA. In contrast, chimeric artifact reads of endonuclease-treated libraries contained palindromic sequences with mismatched bases. Based on these distinctive features, we proposed a mechanistic hypothesis model, PDSM (pairing of partial single strands derived from a similar molecule), by which these sequencing errors derive from ultrasonication and enzymatic fragmentation library preparation. We developed a bioinformatic algorithm to generate a custom mutation “blacklist” in the BED region to reduce errors in downstream analyses.
Conclusions
We first proposed a mechanistic hypothesis model (PDSM) of sequencing errors caused by specific structures of inverted repeat sequences and palindromic sequences in the natural genome. This new hypothesis predicts the existence of chimeric reads that could not be explained by previous models, and provides a new direction for further improving NGS analysis accuracy. A bioinformatic algorithm, ArtifactsFinder, was developed and used to reduce the sequencing errors in libraries produced using sonication and enzymatic fragmentation.
Journal Article
An overlooked phenomenon: complex interactions of potential error sources on the quality of bacterial de novo genome assemblies
by
Kardos, Gábor
,
Nagy, Nikoletta Andrea
,
Schmitt, Nicholas
in
Accuracy
,
Analysis
,
Animal Genetics and Genomics
2024
Background
Parameters adversely affecting the contiguity and accuracy of the assemblies from Illumina next-generation sequencing (NGS) are well described. However, past studies generally focused on their additive effects, overlooking their potential interactions possibly exacerbating one another’s effects in a multiplicative manner. To investigate whether or not they act interactively on de novo genome assembly quality, we simulated sequencing data for 13 bacterial reference genomes, with varying levels of error rate, sequencing depth, PCR and optical duplicate ratios.
Results
We assessed the quality of assemblies from the simulated sequencing data with a number of contiguity and accuracy metrics, which we used to quantify both additive and multiplicative effects of the four parameters. We found that the tested parameters are engaged in complex interactions, exerting multiplicative, rather than additive, effects on assembly quality. Also, the ratio of non-repeated regions and GC% of the original genomes can shape how the four parameters affect assembly quality.
Conclusions
We provide a framework for consideration in future studies using de novo genome assembly of bacterial genomes, e.g. in choosing the optimal sequencing depth, balancing between its positive effect on contiguity and negative effect on accuracy due to its interaction with error rate. Furthermore, the properties of the genomes to be sequenced also should be taken into account, as they might influence the effects of error sources themselves.
Journal Article