Catalogue Search | MBRL

LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads

by Behsaz, Bahar , Yang, Chen , Jones, Steven J. M. in Algorithms , Alignment , Assemblies

2015

Abstract Background Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value. Results We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes. Conclusions This study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.

Journal Article

Share this book

Add to My Shelf

GoldPolish-target: targeted long-read genome assembly polishing

by Zhang, Emily , Wong, Johnathan , Coombe, Lauren in Algorithms , Animals , Assemblies

2025

Background Advanced long-read sequencing technologies, such as those from Oxford Nanopore Technologies and Pacific Biosciences, are finding a wide use in de novo genome sequencing projects. However, long reads typically have higher error rates relative to short reads. If left unaddressed, subsequent genome assemblies may exhibit high base error rates that compromise the reliability of downstream analysis. Several specialized error correction tools for genome assemblies have since emerged, employing a range of algorithms and strategies to improve base quality. However, despite these efforts, many genome assembly workflows still produce regions with elevated error rates, such as gaps filled with unpolished or ambiguous bases. To address this, we introduce GoldPolish-Target, a modular targeted sequence polishing pipeline. Coupled with GoldPolish, a linear-time genome assembly algorithm, GoldPolish-Target isolates and polishes user-specified assembly loci, offering a resource-efficient means for polishing targeted regions of draft genomes. Results Experiments using Drosophila melanogaster and Homo sapiens datasets demonstrate that GoldPolish-Target can reduce insertion/deletion (indel) and mismatch errors by up to 49.2% and 55.4% respectively, achieving base accuracy values upwards of 99.9% (Phred score Q > 30). This polishing accuracy is comparable to the current state-of-the-art, Medaka, while exhibiting up to 27-fold shorter run times and consuming 95% less memory, on average. Conclusion GoldPolish-Target, in contrast to most other polishing tools, offers the ability to target specific regions of a genome assembly for polishing, providing a computationally light-weight and highly scalable solution for base error correction.

Journal Article

Share this book

Add to My Shelf

Linear time complexity de novo long read genome assembly with GoldRush

by Zhang, Emily , Wong, Johnathan , Coombe, Lauren in 49/23 , 631/114/2785/2302 , 631/114/794

2023

Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap – its most costly step – was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation. Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. GoldRush departs from this paradigm, generating highly contiguous assemblies with linear time complexity and using an order of magnitude less RAM than state-of-the-art methods.

Journal Article

Share this book

Add to My Shelf

Sealer: a scalable gap-closing application for finishing draft genomes

by Paulino, Daniel , Warren, René L. , Vandervalk, Benjamin P. in Algorithms , Bioinformatics , Biomedical and Life Sciences

2015

Background While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment “gaps” – uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes. Results Here we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8 % and 13.8 % of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively – a feat that is not possible with other leading tools with the breadth of data used in our study. Conclusion Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release .

Journal Article

Share this book

Add to My Shelf

RResolver: efficient short-read repeat resolution within ABySS

by Wong, Johnathan , Coombe, Lauren , Birol, Inanç in Algorithms , Assembly , Bioinformatics

2022

Background De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k , also known as k-mers. Edges in this graph are established between nodes that overlap by k - 1 bases, and nodes along unambiguous walks in the graph are subsequently merged. The selection of k is influenced by multiple factors, and optimizing this value results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, current approaches that use multiple k values do not address the scalability issues inherent to the assembly of large genomes. Results Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths with insufficient support. RResolver runs efficiently, taking only 26 min on average for an ABySS human assembly with 48 threads and 60 GiB memory. Across all experiments, compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 15% and reduces misassemblies by up to 12%. Conclusions RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome. The RResolver code is integrated into ABySS and is available at https://github.com/bcgsc/abyss/tree/master/RResolver .

Journal Article

Share this book

Add to My Shelf

Antimicrobial peptides from Rana Lithobates catesbeiana: Gene structure and bioinformatic identification of novel forms from tadpoles

by Cameron, Caroline E. , Helbing, Caren C. , Houston, Simon in 38/91 , 631/208/199 , 631/250/2499

2019

Antimicrobial peptides (AMPs) exhibit broad-spectrum antimicrobial activity, and have promise as new therapeutic agents. While the adult North American bullfrog ( Rana [ Lithobates ] catesbeiana ) is a prolific source of high-potency AMPs, the aquatic tadpole represents a relatively untapped source for new AMP discovery. The recent publication of the bullfrog genome and transcriptomic resources provides an opportune bridge between known AMPs and bioinformatics-based AMP discovery. The objective of the present study was to identify novel AMPs with therapeutic potential using a combined bioinformatics and wet lab-based approach. In the present study, we identified seven novel AMP precursor-encoding transcripts expressed in the tadpole. Comparison of their amino acid sequences with known AMPs revealed evidence of mature peptide sequence conservation with variation in the prepro sequence. Two mature peptide sequences were unique and demonstrated bacteriostatic and bactericidal activity against Mycobacteria but not Gram-negative or Gram-positive bacteria. Nine known and seven novel AMP-encoding transcripts were detected in premetamorphic tadpole back skin, olfactory epithelium, liver, and/or tail fin. Treatment of tadpoles with 10 nM 3,5,3′-triiodothyronine for 48 h did not affect transcript abundance in the back skin, and had limited impact on these transcripts in the other three tissues. Gene mapping revealed considerable diversity in size (1.6–15 kbp) and exon number (one to four) of AMP-encoding genes with clear evidence of alternative splicing leading to both prepro and mature amino acid sequence diversity. These findings verify the accuracy and utility of the bullfrog genome assembly, and set a firm foundation for bioinformatics-based AMP discovery.

Journal Article

Share this book

Add to My Shelf

Assembly and annotation of the black spruce genome provide insights on spruce phylogeny and evolution of stress response

by Pandoh, Pawan , Kirk, Heather , Coombe, Lauren in Genomes , Phylogenetics

2024

Black spruce (Picea mariana [Mill.] B.S.P.) is a dominant conifer species in the North American boreal forest that plays important ecological and economic roles. Here, we present the first genome assembly of P. mariana with a reconstructed genome size of 18.3 Gbp and NG50 scaffold length of 36.0 kbp. A total of 66,332 protein-coding sequences were predicted in silico and annotated based on sequence homology. We analyzed the evolutionary relationships between P. mariana and 5 other spruces for which complete nuclear and organelle genome sequences were available. The phylogenetic tree estimated from mitochondrial genome sequences agrees with biogeography; specifically, P. mariana was strongly supported as a sister lineage to P. glauca and 3 other taxa found in western North America, followed by the European Picea abies. We obtained mixed topologies with weaker statistical support in phylogenetic trees estimated from nuclear and chloroplast genome sequences, indicative of ancient reticulate evolution affecting these 2 genomes. Clustering of protein-coding sequences from the 6 Picea taxa and 2 Pinus species resulted in 34,776 orthogroups, 560 of which appeared to be specific to P. mariana. Analysis of these specific orthogroups and dN/dS analysis of positive selection signatures for 497 single-copy orthogroups identified gene functions mostly related to plant development and stress response. The P. mariana genome assembly and annotation provides a valuable resource for forest genetics research and applications in this broadly distributed species, especially in relation to climate adaptation.

Journal Article

Share this book

Add to My Shelf

Genome and transcriptome analyses of the mountain pine beetle-fungal symbiont Grosmannia clavigera, a lodgepole pine pathogen

by Chan, Simon K , Tanguay, Philippe , Holt, Robert A in Animals , Antimicrobial agents , Bark beetles

2011

In western North America, the current outbreak of the mountain pine beetle (MPB) and its microbial associates has destroyed wide areas of lodgepole pine forest, including more than 16 million hectares in British Columbia. Grosmannia clavigera (Gc), a critical component of the outbreak, is a symbiont of the MPB and a pathogen of pine trees. To better understand the interactions between Gc, MPB, and lodgepole pine hosts, we sequenced the ∼30-Mb Gc genome and assembled it into 18 supercontigs. We predict 8,314 protein-coding genes, and support the gene models with proteome, expressed sequence tag, and RNA-seq data. We establish that Gc is heterothallic, and report evidence for repeat-induced point mutation. We report insights, from genome and transcriptome analyses, into how Gc tolerates conifer-defense chemicals, including oleoresin terpenoids, as they colonize a host tree. RNA-seq data indicate that terpenoids induce a substantial antimicrobial stress in Gc, and suggest that the fungus may detoxify these chemicals by using them as a carbon source. Terpenoid treatment strongly activated a ∼100-kb region of the Gc genome that contains a set of genes that may be important for detoxification of these host-defense chemicals. This work is a major step toward understanding the biological interactions between the tripartite MPB/fungus/forest system.

Journal Article

Share this book

Add to My Shelf

The genome of the forest insect pest Pissodes strobi reveals genome expansion and evidence of a Wolbachia endosymbiont

by Coombe, Lauren , Gagalova, Kristina K , Jones, Steven J M in Eggs , Forestry , Genetic testing

2022

The highly diverse insect family of true weevils, Curculionidae, includes many agricultural and forest pests. Pissodes strobi, commonly known as the spruce weevil or white pine weevil, is a major pest of spruce and pine forests in North America. Pissodes strobi larvae feed on the apical shoots of young trees, causing stunted growth and can destroy regenerating spruce or pine forests. Here, we describe the nuclear and mitochondrial Pissodes strobi genomes and their annotations, as well as the genome of an apparent Wolbachia endosymbiont. We report a substantial expansion of the weevil nuclear genome, relative to other Curculionidae species, possibly driven by an abundance of class II DNA transposons. The endosymbiont observed belongs to a group (supergroup A) of Wolbachia species that generally form parasitic relationships with their arthropod host.

Journal Article

Share this book

Add to My Shelf

aaHash: recursive amino acid sequence hashing

by Kazemi, Parham , Wong, Johnathan , Coombe, Lauren

2023

Abstract Motivation K-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences. Results Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ∼10× faster than generic string hashing algorithms in hashing adjacent k-mers. Availability and implementation aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter