Catalogue Search | MBRL

The khmer software package: enabling efficient nucleotide sequence analysis version 1; peer review: 2 approved, 1 approved with reservations

by Charbonneau, Amanda , Guermond, Sarah , Hyer, Alex in Bioinformatics , Software Tool

2015

The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/.

Journal Article

Share this book

Add to My Shelf

Measuring Genome Sizes Using Read-Depth, k-mers, and Flow Cytometry: Methodological Comparisons in Beetles (Coleoptera)

by Holmes, Valerie Renee , Burrus, Crystal , J Spencer Johnston in Flow cytometry , Genomes , Whole genome sequencing

2020

Measuring genome size across different species can yield important insights into evolution of the genome and allow for more informed decisions when designing next-generation genomic sequencing projects. New techniques for estimating genome size using shallow genomic sequence data have emerged which have the potential to augment our knowledge of genome sizes, yet these methods have only been used in a limited number of empirical studies. In this project, we compare estimation methods using next-generation sequencing (k-mer methods and average read depth of single-copy genes) to measurements from flow cytometry, a standard method for genome size measures, using ground beetles (Carabidae) and other members of the beetle suborder Adephaga as our test system. We also present a new protocol for using read-depth of single-copy genes to estimate genome size. Additionally, we report flow cytometry measurements for five previously unmeasured carabid species, as well as 21 new draft genomes and six new draft transcriptomes across eight species of adephagan beetles. No single sequence-based method performed well on all species, and all tended to underestimate the genome sizes, although only slightly in most samples. For one species, Bembidion sp. nr. transversale, most sequence-based methods yielded estimates half the size suggested by flow cytometry.

Journal Article

Share this book

Add to My Shelf

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data

by Ren, Jie , Lu, Yang Young , Ahlgren, Nathan A. in Bioinformatics , Biomedical and Life Sciences , Biomedicine

2017

Background Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses. Methods We have developed VirFinder, the first k -mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k -mer signatures. VirFinder’s performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014. Results VirFinder had significantly better rates of identifying true viral contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art gene-based virus classification tool, when evaluated with either contigs subsampled from complete genomes or assembled from a simulated human gut metagenome. For example, for contigs subsampled from complete genomes, VirFinder had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb contigs, respectively, at the same false positive rates as VirSorter (0, 0.003, and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter. VirFinder furthermore identified several recently sequenced virus genomes (after 1 January 2014) that VirSorter did not and that have no nucleotide similarity to previously sequenced viruses, demonstrating VirFinder’s potential advantage in identifying novel viral sequences. Application of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients reveals higher viral diversity in healthy individuals than cirrhosis patients. We also identified contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a putative Veillonella genus prophage associated with cirrhosis patients. Conclusions This innovative k -mer based tool complements gene-based approaches and will significantly improve prokaryotic viral sequence identification, especially for metagenomic-based studies of viral ecology.

Journal Article

Share this book

Add to My Shelf

Multiple comparative metagenomics using multiset k -mer counting

by Schbath, Sophie , Drezen, Erwan , Benoit, Gaëtan in Biodiversity , Bioinformatics , Biology

2016

Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts by k-mer counts. Simka scales-up today's metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling.

Journal Article

Share this book

Add to My Shelf

GWAS reveals a rapidly evolving candidate avirulence effector in the Cercospora leaf spot pathogen

by McDonald, Bruce A. , Chen, Chen , Neu, Enzo in Agricultural production , avirulence effector , Cercospora

2024

The major resistance gene BvCR4 recently bred into sugar beet hybrids provides a high level of resistance to Cercospora leaf spot caused by the fungal pathogen Cercospora beticola. The occurrence of pathogen strains that overcome BvCR4 was studied using field trials in Switzerland conducted under natural disease pressure. Virulence of a subset of these strains was evaluated in a field trial conducted under elevated artificial disease pressure. We created a new C. beticola reference genome and mapped whole genome sequences of 256 isolates collected in Switzerland and Germany. These were combined with virulence phenotypes to conduct three separate genome‐wide association studies (GWAS) to identify candidate avirulence genes. We identified a locus associated with avirulence containing a putative avirulence effector gene named AvrCR4. All virulent isolates either lacked AvrCR4 or had nonsynonymous mutations within the gene. AvrCR4 was present in all 74 isolates from non‐BvCR4 hybrids, whereas 33 of 89 isolates from BvCR4 hybrids carried a deletion. We also mapped genomic data from 190 publicly available US isolates to our new reference genome. The AvrCR4 deletion was found in only one of 95 unique isolates from non‐BvCR4 hybrids in the United States. AvrCR4 presents a unique example of an avirulence effector in which virulent alleles have only recently emerged. Most likely these were selected out of standing genetic variation after deployment of BvCR4. Identification of AvrCR4 will enable real‐time screening of C. beticola populations for the emergence and spread of virulent isolates. We found a candidate avirulence effector gene in the pathogen that causes Cercospora leaf spot on sugar beet; gene presence/absence is related to avirulence/virulence on resistant sugar beet hybrids.

Journal Article

Share this book

Add to My Shelf

PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme

by Zhang, Junying , Li, Aimin , Zhou, Zhongyin in Accuracy , Algorithms , Alignment

2014

Background High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing. Results We present an alignment-free tool called PLEK ( p redictor of l ong non-coding RNAs and m e ssenger RNAs based on an improved k -mer scheme), which uses a computational pipeline based on an improved k -mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner. Conclusions PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/ .

Journal Article

Share this book

Add to My Shelf

ACP-DL: A Deep Learning Long Short-Term Memory Model to Predict Anticancer Peptides Using High-Efficiency Feature Representation

by You, Zhu-Hong , Cheng, Li , Chen, Zhan-Heng in Accuracy , Amino acids , anticancer peptides

2019

Cancer is a well-known killer of human beings, which has led to countless deaths and misery. Anticancer peptides open a promising perspective for cancer treatment, and they have various attractive advantages. Conventional wet experiments are expensive and inefficient for finding and identifying novel anticancer peptides. There is an urgent need to develop a novel computational method to predict novel anticancer peptides. In this study, we propose a deep learning long short-term memory (LSTM) neural network model, ACP-DL, to effectively predict novel anticancer peptides. More specifically, to fully exploit peptide sequence information, we developed an efficient feature representation approach by integrating binary profile feature and k-mer sparse matrix of the reduced amino acid alphabet. Then we implemented a deep LSTM model to automatically learn how to identify anticancer peptides and non-anticancer peptides. To our knowledge, this is the first time that the deep LSTM model has been applied to predict anticancer peptides. It was demonstrated by cross-validation experiments that the proposed ACP-DL remarkably outperformed other comparison methods with high accuracy and satisfied specificity on benchmark datasets. In addition, we also contributed two new anticancer peptides benchmark datasets, ACP740 and ACP240, in this work. The source code and datasets are available at https://github.com/haichengyi/ACP-DL.

Journal Article

Share this book

Add to My Shelf

RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification

by Treangen, Todd J. , Phillippy, Adam M. , Koren, Sergey in ancestry , Animal Genetics and Genomics , Bayesian analysis

2018

In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k -mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.

Journal Article

Share this book

Add to My Shelf

Transcription factor-binding k-mer analysis clarifies the cell type dependency of binding specificities and cis-regulatory SNPs in humans

by Tahara, Saeko , Tsuchiya, Takaho , Matsumoto, Hirotaka in Analysis , Animal Genetics and Genomics , Binding

2023

Background Transcription factors (TFs) exhibit heterogeneous DNA-binding specificities in individual cells and whole organisms under natural conditions, and de novo motif discovery usually provides multiple motifs, even from a single chromatin immunoprecipitation-sequencing (ChIP-seq) sample. Despite the accumulation of ChIP-seq data and ChIP-seq-derived motifs, the diversity of DNA-binding specificities across different TFs and cell types remains largely unexplored. Results Here, we applied MOCCS2, our k -mer-based motif discovery method, to a collection of human TF ChIP-seq samples across diverse TFs and cell types, and systematically computed profiles of TF-binding specificity scores for all k -mers. After quality control, we compiled a set of TF-binding specificity score profiles for 2,976 high-quality ChIP-seq samples, comprising 473 TFs and 398 cell types. Using these high-quality samples, we confirmed that the k -mer-based TF-binding specificity profiles reflected TF- or TF-family dependent DNA-binding specificities. We then compared the binding specificity scores of ChIP-seq samples with the same TFs but with different cell type classes and found that half of the analyzed TFs exhibited differences in DNA-binding specificities across cell type classes. Additionally, we devised a method to detect differentially bound k -mers between two ChIP-seq samples and detected k -mers exhibiting statistically significant differences in binding specificity scores. Moreover, we demonstrated that differences in the binding specificity scores between k -mers on the reference and alternative alleles could be used to predict the effect of variants on TF binding, as validated by in vitro and in vivo assay datasets. Finally, we demonstrated that binding specificity score differences can be used to interpret disease-associated non-coding single-nucleotide polymorphisms (SNPs) as TF-affecting SNPs and provide candidates responsible for TFs and cell types. Conclusions Our study provides a basis for investigating the regulation of gene expression in a TF-, TF family-, or cell-type-dependent manner. Furthermore, our differential analysis of binding-specificity scores highlights noncoding disease-associated variants in humans.

Journal Article

Share this book

Add to My Shelf

Small Bugs, Big Data: Metagenomics for Arthropod Biodiversity Monitoring

by López Clinton, Samantha , Goodsell, Robert , Miraldo, Andreia in Arthropods , Bar codes , Big Data

2025

Obtaining genome‐wide data from complex samples, such as environmental material or bulk species collections, is increasingly feasible, yet inferring species presence and population genomic insights remains challenging. We applied metagenomic sequencing to 40 arthropod bulk samples collected with Malaise traps across Sweden and compared results with metabarcoding of the same material. Using a custom genome database, we achieved genus‐level classification largely consistent with metabarcoding. While metagenomics detected all genera identified by metabarcoding, conservative filtering thresholds designed to minimise false positives also excluded some true signals, particularly for low‐abundance taxa. Taxonomic overlap between methods was further constrained by limited reference database representation. Beyond taxonomic assignment, metagenomic sequencing yielded genome‐level information: we inferred haplotype diversity, heterozygosity and geographic population structure for several abundant species, including variable degrees of hybrid origin in red wood ants and the genetic distinctiveness of Gotland bumblebees. Finally, by‐catch plant DNA present in the bulk samples revealed plausible arthropod–plant interactions, several of which align with known ecological associations. Together, these results demonstrate the potential of metagenomics for biodiversity monitoring and population genomics, while underscoring the importance of filtering criteria and comprehensive reference databases. We used metagenomic sequencing of 40 bulk arthropod samples collected across Sweden to classify taxa and compare results with metabarcoding from the same samples. While metabarcoding was more sensitive for detecting low‐abundance taxa, taxonomic overlap between methods was strongly influenced by reference database representation. Metagenomics achieved genus‐level classifications broadly consistent with metabarcoding and further provided genome‐level insights into population structure, genetic diversity and hybridisation, as well as plausible arthropod–plant interactions through by‐catch DNA.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter