Catalogue Search | MBRL

Drastic reduction of false positive species in samples of insects by intersecting the default output of two popular metagenomic classifiers

by Garrido-Sanz, Lidia , Piñol, Josep , Àngel Senar, Miquel in Analysis , Biodiversity , Biological diversity

2022

The use of high-throughput sequencing to recover short DNA reads of many species has been widely applied on biodiversity studies, either as amplicon metabarcoding or shotgun metagenomics. These reads are assigned to taxa using classifiers. However, for different reasons, the results often contain many false positives. Here we focus on the reduction of false positive species attributable to the classifiers. We benchmarked two popular classifiers, BLASTn followed by MEGAN6 (BM) and Kraken2 (K2), to analyse shotgun sequenced artificial single-species samples of insects. To reduce the number of misclassified reads, we combined the output of the two classifiers in two different ways: (1) by keeping only the reads that were attributed to the same species by both classifiers (intersection approach); and (2) by keeping the reads assigned to some species by any classifier (union approach). In addition, we applied an analytical detection limit to further reduce the number of false positives species. As expected, both metagenomic classifiers used with default parameters generated an unacceptably high number of misidentified species (tens with BM, hundreds with K2). The false positive species were not necessarily phylogenetically close, as some of them belonged to different orders of insects. The union approach failed to reduce the number of false positives, but the intersection approach got rid of most of them. The addition of an analytic detection limit of 0.001 further reduced the number to ca . 0.5 false positive species per sample. The misidentification of species by most classifiers hampers the confidence of the DNA-based methods for assessing the biodiversity of biological samples. Our approach to alleviate the problem is straightforward and significantly reduced the number of reported false positive species.

Journal Article

Share this book

Add to My Shelf

A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures

by Senar, Miquel Angel , Lenis, Josefina in Alignment , Computer Communication Networks , Computer Science

2017

Over the last several years, many sequence alignment tools have appeared and become popular for the fast evolution of next generation sequencing technologies. Obviously, researchers that use such tools are interested in getting maximum performance when they execute them in modern infrastructures. Today’s NUMA (Non-uniform memory access) architectures present major challenges in getting such applications to achieve good scalability as more processors/cores are used. The memory system in NUMA systems shows a high complexity and may be the main cause for the loss of an application’s performance. The existence of several memory banks in NUMA systems implies a logical increase in latency associated with the accesses of a given processor to a remote bank. This phenomenon is usually attenuated by the application of strategies that tend to increase the locality of memory accesses. However, NUMA systems may also suffer from contention problems that can occur when concurrent accesses are concentrated on a reduced number of banks. Sequence alignment tools use large data structures to contain reference genomes to which all reads are aligned. Therefore, these tools are very sensitive to performance problems related to the memory system. The main goal of this study is to explore the trade-offs between data locality and data dispersion in NUMA systems. We have performed experiments with several popular sequence alignment tools on two widely available NUMA systems to assess the performance of different memory allocation policies and data partitioning strategies. We find that there is not one method that is best in all cases. However, we conclude that memory interleaving is the memory allocation strategy that provides the best performance when a large number of processors and memory banks are used. In the case of data partitioning, the best results are usually obtained when the number of partitions used is greater, sometimes combined with an interleave policy.

Journal Article

Share this book

Add to My Shelf

Resolving the full set of human polymorphic inversions and other complex variants from ultra-long read data

by Senar, Miquel Àngel , Yakymenko, Illya , Martínez-Urtaza, Jaime in Genomics

2025

Inversions are a unique type of balanced structural variants (SVs) with important consequences in multiple organisms. However, despite considerable effort, this and other complex SVs remain poorly characterized due to the presence of large repeats. New techniques are finally allowing us to identify the full spectrum of human inversions, but the number of individuals analyzed is still quite limited. Here, we take advantage of Oxford Nanopore Technologies (ONT) long reads to characterize an exhaustive catalogue of 612 candidate inversions between 197 bp and 4.4 Mb of length and flanked by <190-kb long inverted repeats (IRs). For that, we developed a bioinformatic package to identify inversion alleles reliably from long read data. Next, using a combination of different DNA extraction, library preparation, and ONT sequencing protocols, we showed that ultra-long reads (50-100 kb) and adaptive sampling are an efficient method to detect most human inversions. Lastly, by analyzing ONT data from 54 diverse individuals, 87-99% of the inversions could be genotyped in each sample, depending mainly on read and IR length and genome coverage. Both orientations were observed for 155 of the analyzed regions (frequency 0.01-0.49), which multiplies by three the polymorphic IR-mediated inversions studied in detail so far. Moreover, we found more than 300 additional independent SVs in the studied regions and resolved several complex rearrangements. Our work therefore provides an accurate benchmark of those inversions that typically escape most analyses, improving existing resources, such as the Pangenome. In addition, it demonstrates the potential of nanopore sequencing to determine the functional impact of missing human genomic variation.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter