Catalogue Search | MBRL

FADU: a Quantification Tool for Prokaryotic Transcriptomic Analyses

by Fraser, Claire M. , Adkins, Ricky S. , Mattick, John S. A. in bacteria , Comment On , differential expression

2021

Most currently available quantification tools for transcriptomics analyses have been designed for human data sets, in which full-length transcript sequences, including the untranslated regions, are well annotated. In most prokaryotic systems, full-length transcript sequences have yet to be characterized, leading to prokaryotic transcriptomics analyses being performed based on only the coding sequences. Quantification tools for RNA sequencing (RNA-Seq) analyses are often designed and tested using human transcriptomics data sets, in which full-length transcript sequences are well annotated. For prokaryotic transcriptomics experiments, full-length transcript sequences are seldom known, and coding sequences must instead be used for quantification steps in RNA-Seq analyses. However, operons confound accurate quantification of coding sequences since a single transcript does not necessarily equate to a single gene. Here, we introduce FADU (Feature Aggregate Depth Utility), a quantification tool designed specifically for prokaryotic RNA-Seq analyses. FADU assigns partial count values proportional to the length of the fragment overlapping the target feature. To assess the ability of FADU to quantify genes in prokaryotic transcriptomics analyses, we compared its performance to those of eXpress, featureCounts, HTSeq, kallisto, and Salmon across three paired-end read data sets of (i) Ehrlichia chaffeensis , (ii) Escherichia coli , and (iii) the Wolbachia endosymbiont w Bm. Across each of the three data sets, we find that FADU can more accurately quantify operonic genes by deriving proportional counts for multigene fragments within operons. FADU is available at https://github.com/IGS/FADU . IMPORTANCE Most currently available quantification tools for transcriptomics analyses have been designed for human data sets, in which full-length transcript sequences, including the untranslated regions, are well annotated. In most prokaryotic systems, full-length transcript sequences have yet to be characterized, leading to prokaryotic transcriptomics analyses being performed based on only the coding sequences. In contrast to eukaryotes, prokaryotes contain polycistronic transcripts, and when genes are quantified based on coding sequences instead of transcript sequences, this leads to an increased abundance of improperly assigned ambiguous multigene fragments, specifically those mapping to multiple genes in operons. Here, we describe FADU, a quantification tool for prokaryotic RNA-Seq analyses designed to assign proportional counts with the purpose of better quantifying operonic genes while minimizing the pitfalls associated with improperly assigning fragment counts from ambiguous transcripts.

Journal Article

Share this book

Add to My Shelf

mtR_find : A parallel processing tool to identify and annotate RNAs derived from the mitochondrial genome

by Mohideen, Asan Meera Sahib Haja , Babiak, Igor Szczepan , Johansen, Steinar Daae in Animals , Cancer , Datasets

2023

RNAs originating from mitochondrial genomes are abundant in transcriptomic datasets produced by high-throughput sequencing technologies, primarily in short-read outputs. Specific features of mitochondrial small RNAs (mt-sRNAs), such as non-templated additions, presence of length variants, sequence variants, and other modifications, necessitate the need for the development of an appropriate tool for their effective identification and annotation. We have developed mtR_find, a tool to detect and annotate mitochondrial RNAs, including mt-sRNAs and mitochondria-derived long non-coding RNAs (mt-lncRNA). mtR_find uses a novel method to compute the count of RNA sequences from adapter-trimmed reads. When analyzing the published datasets with mtR_find, we identified mt-sRNAs significantly associated with the health conditions, such as hepatocellular carcinoma and obesity, and we discovered novel mt-sRNAs. Furthermore, we identified mt-lncRNAs in early development in mice. These examples show the immediate impact of miR_find in extracting a novel biological information from the existing sequencing datasets. For benchmarking, the tool has been tested on a simulated dataset and the results were concordant. For accurate annotation of mitochondria-derived RNA, particularly mt-sRNA, we developed an appropriate nomenclature. mtR_find encompasses the mt-ncRNA transcriptomes in unpreceded resolution and simplicity, allowing re-analysis of the existing transcriptomic databases and the use of mt-ncRNAs as diagnostic or prognostic markers in the field of medicine.

Journal Article

Share this book

Add to My Shelf

Statistical Framework for eQTL Mapping Using RNA‐seq Data

by Sun, Wei in Algorithms , Allele-specific expression (ASE) , Alleles

2012

RNA‐seq may replace gene expression microarrays in the near future. Using RNA‐seq, the expression of a gene can be estimated using the total number of sequence reads mapped to that gene, known as the total read count (TReC). Traditional expression quantitative trait locus (eQTL) mapping methods, such as linear regression, can be applied to TReC measurements after they are properly normalized. In this article, we show that eQTL mapping, by directly modeling TReC using discrete distributions, has higher statistical power than the two‐step approach: data normalization followed by linear regression. In addition, RNA‐seq provides information on allele‐specific expression (ASE) that is not available from microarrays. By combining the information from TReC and ASE, we can computationally distinguish cis‐ and trans‐eQTL and further improve the power of cis‐eQTL mapping. Both simulation and real data studies confirm the improved power of our new methods. We also discuss the design issues of RNA‐seq experiments. Specifically, we show that by combining TReC and ASE measurements, it is possible to minimize cost and retain the statistical power of cis‐eQTL mapping by reducing sample size while increasing the number of sequence reads per sample. In addition to RNA‐seq data, our method can also be employed to study the genetic basis of other types of sequencing data, such as chromatin immunoprecipitation followed by DNA sequencing data. In this article, we focus on eQTL mapping of a single gene using the association‐based method. However, our method establishes a statistical framework for future developments of eQTL mapping methods using RNA‐seq data (e.g., linkage‐based eQTL mapping), and the joint study of multiple genetic markers and/or multiple genes.

Journal Article

Share this book

Add to My Shelf

Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data

by Nam, Dougu , Yoon, Sora in Adult , Animal Genetics and Genomics , Bias

2017

Background In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data. Results We show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. We demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not. Conclusion We showed the small gene variance (similarly, dispersion) is the main cause of read count bias (and gene length bias) for the first time and analyzed the read count bias for different replicate types of RNA-seq data and its effect on gene-set enrichment analysis.

Journal Article

Share this book

Add to My Shelf

ASSIGNMENT OF ENDOGENOUS RETROVIRUS INTEGRATION SITES USING A MIXTURE MODEL

by Poss, Mary , Hunter, David R. , Bao, Le

2017

Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.

Journal Article

Share this book

Add to My Shelf

A fast and efficient count-based matrix factorization method for detecting cell types from single-cell RNAseq data

by Shang, Xuequn , Sun, Shiquan , Liu, Yang in Binomial distribution , Bioinformatics , Brain

2019

Background Single-cell RNA sequencing (scRNAseq) data always involves various unwanted variables, which would be able to mask the true signal to identify cell-types. More efficient way of dealing with this issue is to extract low dimension information from high dimensional gene expression data to represent cell-type structure. In the past two years, several powerful matrix factorization tools were developed for scRNAseq data, such as NMF, ZIFA, pCMF and ZINB-WaVE. But the existing approaches either are unable to directly model the raw count of scRNAseq data or are really time-consuming when handling a large number of cells (e.g. n>500). Results In this paper, we developed a fast and efficient count-based matrix factorization method (single-cell negative binomial matrix factorization, scNBMF) based on the TensorFlow framework to infer the low dimensional structure of cell types. To make our method scalable, we conducted a series of experiments on three public scRNAseq data sets, brain, embryonic stem, and pancreatic islet. The experimental results show that scNBMF is more powerful to detect cell types and 10 - 100 folds faster than the scRNAseq bespoke tools. Conclusions In this paper, we proposed a fast and efficient count-based matrix factorization method, scNBMF, which is more powerful for detecting cell type purposes. A series of experiments were performed on three public scRNAseq data sets. The results show that scNBMF is a more powerful tool in large-scale scRNAseq data analysis. scNBMF was implemented in R and Python, and the source code are freely available at https://github.com/sqsun.

Journal Article

Share this book

Add to My Shelf

Impact of data resolution on three-dimensional structure inference methods

by Lin, Shili , Park, Jincheol in Algorithms , Bioinformatics , Biomedical and Life Sciences

2016

Background Assays that are capable of detecting genome-wide chromatin interactions have produced massive amount of data and led to great understanding of the chromosomal three-dimensional (3D) structure. As technology becomes more sophisticated, higher-and-higher resolution data are being produced, going from the initial 1 Megabases (Mb) resolution to the current 10 Kilobases (Kb) or even 1 Kb resolution. The availability of genome-wide interaction data necessitates development of analytical methods to recover the underlying 3D spatial chromatin structure, but challenges abound. Most of the methods were proposed for analyzing data at low resolution (1 Mb). Their behaviors are thus unknown for higher resolution data. For such data, one of the key features is the high proportion of “0” contact counts among all available data, in other words, the excess of zeros. Results To address the issue of excess of zeros, in this paper, we propose a truncated Random effect EXpression (tREX) method that can handle data at various resolutions. We then assess the performance of tREX and a number of leading existing methods for recovering the underlying chromatin 3D structure. This was accomplished by creating in-silico data to mimic multiple levels of resolution and submit the methods to a “stress test”. Finally, we applied tREX and the comparison methods to a Hi-C dataset for which FISH measurements are available to evaluate estimation accuracy. Conclusion The proposed tREX method achieves consistently good performance in all 30 simulated settings considered. It is not only robust to resolution level and underlying parameters, but also insensitive to model misspecification. This conclusion is based on observations made in terms of 3D structure estimation accuracy and preservation of topologically associated domains. Application of the methods to the human lymphoblastoid cell line data on chromosomes 14 and 22 further substantiates the superior performance of tREX: the constructed 3D structure from tREX is consistent with the FISH measurements, and the corresponding distances predicted by tREX have higher correlation with the FISH measurements than any of the comparison methods. Software An open-source R-package is available at http://www.stat.osu.edu/~statgen/Software/tRex .

Journal Article

Share this book

Add to My Shelf

Computational Analysis of AmpSeq Data for Targeted, High-Throughput Genotyping of Amplicons

by Fresnedo-Ramírez, Jonathan , Cadle-Davidson, Lance , Yang, Shanshan in amplicon read counts , Bioinformatics , Computer applications

2019

Amplicon sequencing (AmpSeq) is a practical, intuitive strategy with a semi-automated computational pipeline for analysis of highly multiplexed PCR-derived sequences. This genotyping platform is particularly cost-effective when multiplexing 96 or more samples with a few amplicons up to thousands of amplicons. Amplicons can target from a single nucleotide to the upper limit of the sequencing platform. The flexibility of AmpSeq's wet lab methods make it a tool of broad interest for diverse species, and AmpSeq excels in flexibility, high-throughput, low-cost, accuracy, and semi-automated analysis. Here we provide an open science framework procedure to output data out of an AmpSeq project, with an emphasis on the bioinformatics pipeline to generate SNPs, haplotypes and presence/absence variants in a set of diverse genotypes. Open-access tutorial datasets with actual data and a containerization open source software instance are provided to enable training in each of these genotyping applications. The pipelines presented here should be applicable to the analysis of various target-enriched (e.g., amplicon or sequence capture) Illumina sequence data.

Journal Article

Share this book

Add to My Shelf

LinkImputeR: user-guided genotype calling and imputation for non-model organisms

by Myles, Sean , Migicovsky, Zoë , Gardner, Kyle in Accuracy , Algorithms , Animal Genetics and Genomics

2017

Background Genomic studies such as genome-wide association and genomic selection require genome-wide genotype data. All existing technologies used to create these data result in missing genotypes, which are often then inferred using genotype imputation software. However, existing imputation methods most often make use only of genotypes that are successfully inferred after having passed a certain read depth threshold. Because of this, any read information for genotypes that did not pass the threshold, and were thus set to missing, is ignored. Most genomic studies also choose read depth thresholds and quality filters without investigating their effects on the size and quality of the resulting genotype data. Moreover, almost all genotype imputation methods require ordered markers and are therefore of limited utility in non-model organisms. Results Here we introduce LinkImputeR, a software program that exploits the read count information that is normally ignored, and makes use of all available DNA sequence information for the purposes of genotype calling and imputation. It is specifically designed for non-model organisms since it requires neither ordered markers nor a reference panel of genotypes. Using next-generation DNA sequence (NGS) data from apple, cannabis and grape, we quantify the effect of varying read count and missingness thresholds on the quantity and quality of genotypes generated from LinkImputeR. We demonstrate that LinkImputeR can increase the number of genotype calls by more than an order of magnitude, can improve genotyping accuracy by several percent and can thus improve the power of downstream analyses. Moreover, we show that the effects of quality and read depth filters can differ substantially between data sets and should therefore be investigated on a per-study basis. Conclusions By exploiting DNA sequence data that is normally ignored during genotype calling and imputation, LinkImputeR can significantly improve both the quantity and quality of genotype data generated from NGS technologies. It enables the user to quickly and easily examine the effects of varying thresholds and filters on the number and quality of the resulting genotype calls. In this manner, users can decide on thresholds that are most suitable for their purposes. We show that LinkImputeR can significantly augment the value and utility of NGS data sets, especially in non-model organisms with poor genomic resources.

Journal Article

Share this book

Add to My Shelf

Robust Linear Trend Test for Low-Coverage Next-Generation Sequence Data Controlling for Covariates

by Kim, Myeong-Kyu , Lee, Jung Yeon , Kim, Wonkuk in allele read counts , low-coverage , mixture model

2020

Low-coverage next-generation sequencing experiments assisted by statistical methods are popular in a genetic association study. Next-generation sequencing experiments produce genotype data that include allele read counts and read depths. For low sequencing depths, the genotypes tend to be highly uncertain; therefore, the uncertain genotypes are usually removed or imputed before performing a statistical analysis. It may result in the inflated type I error rate and in a loss of statistical power. In this paper, we propose a mixture-based penalized score association test adjusting for non-genetic covariates. The proposed score test statistic is based on a sandwich variance estimator so that it is robust under the model misspecification between the covariates and the latent genotypes. The proposed method takes advantage of not requiring either external imputation or elimination of uncertain genotypes. The results of our simulation study show that the type I error rates are well controlled and the proposed association test have reasonable statistical power. As an illustration, we apply our statistic to pharmacogenomics data for drug responsiveness among 400 epilepsy patients.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter