Catalogue Search | MBRL

The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets

by Saito, Takaya , Rehmsmeier, Marc in Alternatives , Analysis , Bioinformatics

2015

Binary classifiers are routinely evaluated with performance measures such as sensitivity and specificity, and performance is frequently illustrated with Receiver Operating Characteristics (ROC) plots. Alternative measures such as positive predictive value (PPV) and the associated Precision/Recall (PRC) plots are used less frequently. Many bioinformatics studies develop and evaluate classifiers that are to be applied to strongly imbalanced datasets in which the number of negatives outweighs the number of positives significantly. While ROC plots are visually appealing and provide an overview of a classifier's performance across a wide range of specificities, one can ask whether ROC plots could be misleading when applied in imbalanced classification scenarios. We show here that the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity. PRC plots, on the other hand, can provide the viewer with an accurate prediction of future classification performance due to the fact that they evaluate the fraction of true positives among positive predictions. Our findings have potential implications for the interpretation of a large number of studies that use ROC plots on imbalanced datasets.

Journal Article

Share this book

Add to My Shelf

Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3

by Bredesen-Aa, Bjørn André , Rehmsmeier, Marc in Analysis , Artificial neural networks , Binding sites

2022

Gene expression is regulated through cis -regulatory elements (CREs), among which are promoters, enhancers, Polycomb/Trithorax Response Elements (PREs), silencers and insulators. Computational prediction of CREs can be achieved using a variety of statistical and machine learning methods combined with different feature space formulations. Although Python packages for DNA sequence feature sets and for machine learning are available, no existing package facilitates the combination of DNA sequence feature sets with machine learning methods for the genome-wide prediction of candidate CREs. We here present Gnocis, a Python package that streamlines the analysis and the modelling of CRE sequences by providing extensible APIs and implementing the glue required for combining feature sets and models for genome-wide prediction. Gnocis implements a variety of base feature sets, including motif pair occurrence frequencies and the k-spectrum mismatch kernel. It integrates with Scikit-learn and TensorFlow for state-of-the-art machine learning. Gnocis additionally implements a broad suite of tools for the handling and preparation of sequence, region and curve data, which can be useful for general DNA bioinformatics in Python. We also present Deep-MOCCA, a neural network architecture inspired by SVM-MOCCA that achieves moderate to high generalization without prior motif knowledge. To demonstrate the use of Gnocis, we applied multiple machine learning methods to the modelling of D. melanogaster PREs, including a Convolutional Neural Network (CNN), making this the first study to model PREs with CNNs. The models are readily adapted to new CRE modelling problems and to other organisms. In order to produce a high-performance, compiled package for Python 3, we implemented Gnocis in Cython. Gnocis can be installed using the PyPI package manager by running ‘ pip install gnocis ’. The source code is available on GitHub, at https://github.com/bjornbredesen/gnocis .

Journal Article

Share this book

Add to My Shelf

Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3

by Marc Rehmsmeier , Bjørn André Bredesen-Aa

2022

Gene expression is regulated through cis-regulatory elements (CREs), among which are promoters, enhancers, Polycomb/Trithorax Response Elements (PREs), silencers and insulators. Computational prediction of CREs can be achieved using a variety of statistical and machine learning methods combined with different feature space formulations. Although Python packages for DNA sequence feature sets and for machine learning are available, no existing package facilitates the combination of DNA sequence feature sets with machine learning methods for the genome-wide prediction of candidate CREs. We here present Gnocis, a Python package that streamlines the analysis and the modelling of CRE sequences by providing extensible APIs and implementing the glue required for combining feature sets and models for genome-wide prediction. Gnocis implements a variety of base feature sets, including motif pair occurrence frequencies and the k-spectrum mismatch kernel. It integrates with Scikit-learn and TensorFlow for state-of-the-art machine learning. Gnocis additionally implements a broad suite of tools for the handling and preparation of sequence, region and curve data, which can be useful for general DNA bioinformatics in Python. We also present Deep-MOCCA, a neural network architecture inspired by SVM-MOCCA that achieves moderate to high generalization without prior motif knowledge. To demonstrate the use of Gnocis, we applied multiple machine learning methods to the modelling of D. melanogaster PREs, including a Convolutional Neural Network (CNN), making this the first study to model PREs with CNNs. The models are readily adapted to new CRE modelling problems and to other organisms. In order to produce a high-performance, compiled package for Python 3, we implemented Gnocis in Cython. Gnocis can be installed using the PyPI package manager by running 'pip install gnocis'. The source code is available on GitHub, at https://github.com/bjornbredesen/gnocis.

Journal Article

Share this book

Add to My Shelf

MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics

by Bredesen, Bjørn André , Rehmsmeier, Marc in Algorithms , Base Sequence , Bioinformatics

2021

Background Cis -regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. Results We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. Conclusion MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA .

Journal Article

Share this book

Add to My Shelf

MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

by Brocart, Gilles , Rehmsmeier, Marc , Alemany, Anna in Algorithms , Alignment , Annotations

2022

Background Total-RNA sequencing (total-RNA-seq) allows the simultaneous study of both the coding and the non-coding transcriptome. Yet, computational pipelines have traditionally focused on particular biotypes, making assumptions that are not fullfilled by total-RNA-seq datasets. Transcripts from distinct RNA biotypes vary in length, biogenesis, and function, can overlap in a genomic region, and may be present in the genome with a high copy number. Consequently, reads from total-RNA-seq libraries may cause ambiguous genomic alignments, demanding for flexible quantification approaches. Results Here we present Multi-Graph count (MGcount), a total-RNA-seq quantification tool combining two strategies for handling ambiguous alignments. First, MGcount assigns reads hierarchically to small-RNA and long-RNA features to account for length disparity when transcripts overlap in the same genomic position. Next, MGcount aggregates RNA products with similar sequences where reads systematically multi-map using a graph-based approach. MGcount outputs a transcriptomic count matrix compatible with RNA-sequencing downstream analysis pipelines, with both bulk and single-cell resolution, and the graphs that model repeated transcript structures for different biotypes. The software can be used as a python module or as a single-file executable program. Conclusions MGcount is a flexible total-RNA-seq quantification tool that successfully integrates reads that align to multiple genomic locations or that overlap with multiple gene features. Its approach is suitable for the simultaneous estimation of protein-coding, long non-coding and small non-coding transcript concentration, in both precursor and processed forms. Both source code and compiled software are available at https://github.com/hitaandrea/MGcount .

Journal Article

Share this book

Add to My Shelf

Correction: MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

by Brocart, Gilles , Rehmsmeier, Marc , Alemany, Anna in Algorithms , Bioinformatics , Biomedical and Life Sciences

2022

Journal Article

Share this book

Add to My Shelf

High-Resolution Analysis of Parent-of-Origin Allelic Expression in the Arabidopsis Endosperm

by Wolff, Philip , Weinhofer, Isabelle , Roszak, Pawel in Agricultural Science , Alleles , Allelomorphism

2011

Genomic imprinting is an epigenetic phenomenon leading to parent-of-origin specific differential expression of maternally and paternally inherited alleles. In plants, genomic imprinting has mainly been observed in the endosperm, an ephemeral triploid tissue derived after fertilization of the diploid central cell with a haploid sperm cell. In an effort to identify novel imprinted genes in Arabidopsis thaliana, we generated deep sequencing RNA profiles of F1 hybrid seeds derived after reciprocal crosses of Arabidopsis Col-0 and Bur-0 accessions. Using polymorphic sites to quantify allele-specific expression levels, we could identify more than 60 genes with potential parent-of-origin specific expression. By analyzing the distribution of DNA methylation and epigenetic marks established by Polycomb group (PcG) proteins using publicly available datasets, we suggest that for maternally expressed genes (MEGs) repression of the paternally inherited alleles largely depends on DNA methylation or PcG-mediated repression, whereas repression of the maternal alleles of paternally expressed genes (PEGs) predominantly depends on PcG proteins. While maternal alleles of MEGs are also targeted by PcG proteins, such targeting does not cause complete repression. Candidate MEGs and PEGs are enriched for cis-proximal transposons, suggesting that transposons might be a driving force for the evolution of imprinted genes in Arabidopsis. In addition, we find that MEGs and PEGs are significantly faster evolving when compared to other genes in the genome. In contrast to the predominant location of mammalian imprinted genes in clusters, cluster formation was only detected for few MEGs and PEGs, suggesting that clustering is not a major requirement for imprinted gene regulation in Arabidopsis.

Journal Article

Share this book

Add to My Shelf

Response to Wang and Luo

by Rehmsmeier, Marc in Analysis , Arabidopsis - genetics , Biomedical and Life Sciences

2012

Doc number: 32 Abstract: This article is a response to Wang and Luo. See correspondence article http://www.biomedcentral.com/1741-7007/10/30 and the original research article http://www.biomedcentral.com/1741-7007/9/24 .

Journal Article

Share this book

Add to My Shelf

Polyploidization increases meiotic recombination frequency in Arabidopsis

by Levy, Avraham A , Rehmsmeier, Marc , Mittelsten Scheid, Ortrun in Arabidopsis - cytology , Arabidopsis - genetics , Arabidopsis thaliana

2011

Background Polyploidization is the multiplication of the whole chromosome complement and has occurred frequently in vascular plants. Maintenance of stable polyploid state over generations requires special mechanisms to control pairing and distribution of more than two homologous chromosomes during meiosis. Since a minimal number of crossover events is essential for correct chromosome segregation, we investigated whether polyploidy has an influence on the frequency of meiotic recombination. Results Using two genetically linked transgenes providing seed-specific fluorescence, we compared a high number of progeny from diploid and tetraploid Arabidopsis plants. We show that rates of meiotic recombination in reciprocal crosses of genetically identical diploid and autotetraploid Arabidopsis plants were significantly higher in tetraploids compared to diploids. Although male and female gametogenesis differ substantially in meiotic recombination frequency, both rates were equally increased in tetraploids. To investigate whether multivalent formation in autotetraploids was responsible for the increased recombination rates, we also performed corresponding experiments with allotetraploid plants showing strict bivalent pairing. We found similarly increased rates in auto- and allotetraploids, suggesting that the ploidy effect is independent of chromosome pairing configurations. Conclusions The evolutionary success of polyploid plants in nature and under domestication has been attributed to buffering of mutations and sub- and neo-functionalization of duplicated genes. Should the data described here be representative for polyploid plants, enhanced meiotic recombination, and the resulting rapid creation of genetic diversity, could have also contributed to their prevalence.

Journal Article

Share this book

Add to My Shelf

Robust de novo pathway enrichment with KeyPathwayMiner 5 version 1; peer review: 2 approved

by Ditzel, Henrik J , Rehmsmeier, Marc , Mollenhauer, Jan in Bioinformatics , Genomics , Software Tool

2016

Identifying functional modules or novel active pathways, recently termed de novo pathway enrichment, is a computational systems biology challenge that has gained much attention during the last decade. Given a large biological interaction network, KeyPathwayMiner extracts connected subnetworks that are enriched for differentially active entities from a series of molecular profiles encoded as binary indicator matrices. Since interaction networks constantly evolve, an important question is how robust the extracted results are when the network is modified. We enable users to study this effect through several network perturbation techniques and over a range of perturbation degrees. In addition, users may now provide a gold-standard set to determine how enriched extracted pathways are with relevant genes compared to randomized versions of the original network.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter