Catalogue Search | MBRL

K-mer analysis of long-read alignment pileups for structural variant genotyping

by Cunial, Fabio , Sedlazeck, Fritz J. , Metcalf, Ginger A. in 631/114/1314 , 631/208/212 , 631/208/727

2025

Accurately genotyping structural variant (SV) alleles is crucial to genomics research. We present a novel method (kanpig) for genotyping SVs that leverages variant graphs and k-mer vectors to rapidly generate accurate SV genotypes. Benchmarking against the latest SV datasets shows kanpig achieves a single-sample genotyping concordance of 82.1%, significantly outperforming existing tools, which average 66.3%. We explore kanpig’s use for multi-sample projects by testing on 47 genetically diverse samples and find kanpig accurately genotypes complex loci (e.g. SVs neighboring other SVs), and produces higher genotyping concordance than other tools. Kanpig requires only 43 seconds to process a single sample’s 20x long-reads and can be run on PacBio or Oxford Nanopore long-reads. Accurately genotyping structural variant (SV) alleles is crucial to genomics research. Here the authors present a rapid and accurate method that avoids common errors seen with other genotypers, particularly for neighboring SVs within and across samples.

Journal Article

Share this book

Add to My Shelf

Cue: a deep-learning framework for structural variant discovery and genotyping

by Cunial, Fabio , Meleshko, Dmitry , Garimella, Kiran in 631/114/1305 , 631/114/2785 , 631/1647/794

2023

Structural variants (SVs) are a major driver of genetic diversity and disease in the human genome and their discovery is imperative to advances in precision medicine. Existing SV callers rely on hand-engineered features and heuristics to model SVs, which cannot scale to the vast diversity of SVs nor fully harness the information available in sequencing datasets. Here we propose an extensible deep-learning framework, Cue, to call and genotype SVs that can learn complex SV abstractions directly from the data. At a high level, Cue converts alignments to images that encode SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype and genomic locus of the SVs captured in each image. We show that Cue outperforms the state of the art in the detection of several classes of SVs on synthetic and real short-read data and that it can be easily extended to other sequencing platforms, while achieving competitive performance. Cue achieves versatile and performant structural variant calling and genotyping using a deep-learning approach.

Journal Article

Share this book

Add to My Shelf

A framework for space-efficient read clustering in metagenomic samples

by Cunial, Fabio , Mäkinen, Veli , Alanko, Jarno in Algorithms , Bioinformatics , Biomedical and Life Sciences

2017

Background A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed. Results We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n , with m reads of maximum length ℓ each, on an alphabet of total size σ , our algorithms take O ( n ( t +log σ )) time and just 2 n + o ( n )+ O (max{ ℓ σ log n , K log m }) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure. Conclusions Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the art.

Journal Article

Share this book

Add to My Shelf

LAF: Logic Alignment Free and its application to bacterial genomes classification

by Cunial, Fabio , Weitschek, Emanuel , Felici, Giovanni in Algorithms , Bacterial genetics , Bacteriology

2015

Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length ( k -mers) that occur in the analyzed sequences. In this paper, we present Logic Alignment Free ( LAF ), a method that combines alignment-free techniques and rule-based classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of k -mers whose relative frequencies are used to build classification models as disjunctive-normal-form logic formulas ( if-then rules ). We apply LAF successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few k -mers. State of the art methods to adjust the frequency of k -mers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that LAF is effective in identifying it.

Journal Article

Share this book

Add to My Shelf

Analysis of the subsequence composition of biosequences

by Cunial, Fabio in Bioinformatics , Computer science

2012

Measuring the amount of information and of shared information in biological strings, as well as relating information to structure, function and evolution, are fundamental computational problems in the post-genomic era. Classical analyses of the information content of biosequences are grounded in Shannon's statistical telecommunication theory, while the recent focus is on suitable specializations of the notions introduced by Kolmogorov, Chaitin and Solomonoff, based on data compression and compositional redundancy. Symmetrically, classical estimates of mutual information based on string editing are currently being supplanted by compositional methods hinged on the distribution of controlled substructures. Current compositional analyses and comparisons of biological strings are almost exclusively limited to short sequences of contiguous solid characters. Comparatively little is known about longer and sparser components, both from the point of view of their effectiveness in measuring information and in separating biological strings from random strings, and from the point of view of their ability to classify and to reconstruct phylogenies. Yet, sparse structures are suspected to grasp long-range correlations and, at short range, they are known to encode signatures and motifs that characterize molecular families. In this thesis, we introduce and study compositional measures based on the repertoire of distinct subsequences of any length, but constrained to occur with a predefined maximum gap between consecutive symbols. Such measures highlight previously unknown laws that relate subsequence abundance to string length and to the allowed gap, across a range of structurally and functionally diverse polypeptides. Measures on subsequences are capable of separating only few amino acid strings from their random permutations, but they reveal that random permutations themselves amass along previously undetected, linear loci. This is perhaps the first time in which the vocabulary of all distinct subsequences of a set of structurally and functionally diverse polypeptides is systematically counted and analyzed. Another objective of this thesis is measuring the quality of phylogenies based on the composition of sparse structures. Specifically, we use a set of repetitive gapped patterns, called motifs, whose length and sparsity have never been considered before. We find that extremely sparse motifs in mitochondrial proteomes support phylogenies of comparable quality to state-of-the-art string-based algorithms. Moving from maximal motifs – motifs that cannot be made more specific without losing support – to a set of generators with decreasing size and redundancy, generally degrades classification, suggesting that redundancy itself is a key factor for the efficient reconstruction of phylogenies. This is perhaps the first time in which the composition of all motifs of a proteome is systematically used in phylogeny reconstruction on a large scale. Extracting all maximal motifs, or even their compact generators, is infeasible for entire genomes. In the last part of this thesis, we study the robustness of measures of similarity built around the dictionary of LZW – the variant of the LZ78 compression algorithm proposed by Welch – and of some of its recently introduced gapped variants. These algorithms use a very small vocabulary, they perform linearly in the input strings, and they can be made even faster than LZ in practice. We find that dissimilarity measures based on maximal strings in the dictionary of LZW support phylogenies that are comparable to state-of-the-art methods on test proteomes. Introducing a controlled proportion of gaps does not degrade classification, and allows to discard up to 20% of each input proteome during comparison.

Dissertation

Share this book

Add to My Shelf

Blended Length Genome Sequencing (blend-seq): Combining Short Reads with Low-Coverage Long Reads to Maximize Variant Discovery

by Cunial, Fabio , Paulsen, Ron , Basu, Sumit in Gene mapping , Genomes , Genomics

2025

We introduce blend-seq, a workflow for combining data from traditional short-read sequencing pipelines with low-coverage long reads, to improve variant discovery for single samples without the full cost of high-coverage long reads. We demonstrate that with only 4x long-read coverage augmenting 30x short reads, we can improve SNP discovery across the genome, exceeding performance beyond even high-coverage short reads (60x). For genotype-agnostic discovery of structural variants, we see a threefold improvement in recall while maintaining precision by using the low-coverage long reads on their own, and show how we can improve genotyping accuracy by adding in the short-read data. In addition, we demonstrate how the long reads can better phase these variants, incorporating long-context information in the genome to substantially outperform phasing with short reads alone. Our experiments highlight the complementary nature of short- and long-read technologies: the former contributing higher depth for genotyping and the latter better resolution of larger events or those in difficult regions.

Journal Article

Share this book

Add to My Shelf

K-mer analysis of long-read alignment pileups for structural variant genotyping

by English, Adam C , Cunial, Fabio , Gibbs, Richard A in Bioinformatics

2024

Accurately genotyping structural variant (SV) alleles is crucial to genomics research. We present a novel method (kanpig) for genotyping SVs that leverages variant graphs and k-mer vectors to rapidly generate accurate SV genotypes. We benchmark kanpig against the latest SV benchmarks and show single-sample genotyping concordance of 82.1%, which is higher than existing genotypers averaging 66.3%. We explore kanpig's applicability to multi-sample projects by benchmarking project-level VCFs containing 47 genetically diverse samples and find kanpig accurately genotypes complex loci (e.g. SVs neighboring other SVs), achieving much higher genotyping concordance than other tools. Kanpig requires only 43 seconds to process a single sample's 20x long-reads and can be run on PacBio or ONT long-reads.

Journal Article

Share this book

Add to My Shelf

Fully-functional bidirectional Burrows-Wheeler indexes

by Cunial, Fabio , Belazzougui, Djamal in Data structures , Upper bounds

2019

Given a string \\(T\\) on an alphabet of size \\(\\sigma\\), we describe a bidirectional Burrows-Wheeler index that takes \\(O(|T|\\log{\\sigma})\\) bits of space, and that supports the addition \\emph{and removal} of one character, on the left or right side of any substring of \\(T\\), in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of \\(T\\), but they could support removal only from specific substrings of \\(T\\). We also describe an index that supports bidirectional addition and removal in \\(O(\\log{\\log{|T|}})\\) time, and that occupies a number of words proportional to the number of left and right extensions of the maximal repeats of \\(T\\). We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs in small space, with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal.

Paper

Share this book

Add to My Shelf

Fast and compact matching statistics analytics

by Cunial, Fabio , Belazzougui, Djamal , Denas, Olgert in Bioinformatics , Compression , Genomes

2021

Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state of the art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage, and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. Competing Interest Statement The authors have declared no competing interest.

Paper

Share this book

Add to My Shelf

Fast Label Extraction in the CDAWG

by Cunial, Fabio , Belazzougui, Djamal in Data structures , Genomes , Suffix trees

2017

The compact directed acyclic word graph (CDAWG) of a string \\(T\\) of length \\(n\\) takes space proportional just to the number \\(e\\) of right extensions of the maximal repeats of \\(T\\), and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which \\(e\\) grows significantly more slowly than \\(n\\). We reduce from \\(O(m\\log{\\log{n}})\\) to \\(O(m)\\) the time needed to count the number of occurrences of a pattern of length \\(m\\), using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from \\(O(m\\log{\\log{n}}+\\mathtt{occ})\\) to \\(O(m+\\mathtt{occ})\\) in the time needed to locate all the \\(\\mathtt{occ}\\) occurrences of the pattern. We also reduce from \\(O(k\\log{\\log{n}})\\) to \\(O(k)\\) the time needed to read the \\(k\\) characters of the label of an edge of the suffix tree of \\(T\\), and we reduce from \\(O(m\\log{\\log{n}})\\) to \\(O(m)\\) the time needed to compute the matching statistics between a query of length \\(m\\) and \\(T\\), using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter