Catalogue Search | MBRL

InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams

by da Silva, Felipe R , Heberle, Henry , Meirelles, Gabriela Vaz in Algorithms , Bioinformatics , Biomarkers, Tumor - analysis

2015

Background Set comparisons permeate a large number of data analysis workflows, in particular workflows in biological sciences. Venn diagrams are frequently employed for such analysis but current tools are limited. Results We have developed InteractiVenn, a more flexible tool for interacting with Venn diagrams including up to six sets. It offers a clean interface for Venn diagram construction and enables analysis of set unions while preserving the shape of the diagram. Set unions are useful to reveal differences and similarities among sets and may be guided in our tool by a tree or by a list of set unions. The tool also allows obtaining subsets’ elements, saving and loading sets for further analyses, and exporting the diagram in vector and image formats. InteractiVenn has been used to analyze two biological datasets, but it may serve set analysis in a broad range of domains. Conclusions InteractiVenn allows set unions in Venn diagrams to be explored thoroughly, by consequence extending the ability to analyze combinations of sets with additional observations, yielded by novel interactions between joined sets. InteractiVenn is freely available online at: www.interactivenn.net.

Journal Article

Share this book

Add to My Shelf

Random forest versus logistic regression: a large-scale benchmark experiment

by Couronné, Raphael , Probst, Philipp , Boulesteix, Anne-Laure in Algorithms , Artificial intelligence , Benchmarks

2018

Background and goal The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. Results In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. Conclusion RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and − 0.027 (95%-CI =[−0.034,−0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.

Journal Article

Share this book

Add to My Shelf

GOnet: a tool for interactive Gene Ontology analysis

by Pomaznoy, Mikhail , Peters, Bjoern , Ha, Brendan in Algorithms , Animals , Annotations

2018

Background Biological interpretation of gene/protein lists resulting from -omics experiments can be a complex task. A common approach consists of reviewing Gene Ontology (GO) annotations for entries in such lists and searching for enrichment patterns. Unfortunately, there is a gap between machine-readable output of GO software and its human-interpretable form. This gap can be bridged by allowing users to simultaneously visualize and interact with term-term and gene-term relationships. Results We created the open-source GOnet web-application (available at http://tools.dice-database.org/GOnet/ ), which takes a list of gene or protein entries from human or mouse data and performs GO term annotation analysis (mapping of provided entries to GO subsets) or GO term enrichment analysis (scanning for GO categories overrepresented in the input list). The application is capable of producing parsable data formats and importantly, interactive visualizations of the GO analysis results. The interactive results allow exploration of genes and GO terms as a graph that depicts the natural hierarchy of the terms and retains relationships between terms and genes/proteins. As a result, GOnet provides insight into the functional interconnection of the submitted entries. Conclusions The application can be used for GO analysis of any biological data sources resulting in gene/protein lists. It can be helpful for experimentalists as well as computational biologists working on biological interpretation of -omics data resulting in such lists.

Journal Article

Share this book

Add to My Shelf

Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study

by Hanhineva, Kati , Kokla, Marietta , Kolehmainen, Marjukka in Algorithms , Bias , Bioinformatics

2019

Background LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. Results Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. Conclusion Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.

Journal Article

Share this book

Add to My Shelf

Growthcurver: an R package for obtaining interpretable metrics from microbial growth curves

by Wagner, Andreas , Sprouffske, Kathleen in Algorithms , Applications software , Bacteria - growth & development

2016

Background Plate readers can measure the growth curves of many microbial strains in a high-throughput fashion. The hundreds of absorbance readings collected simultaneously for hundreds of samples create technical hurdles for data analysis. Results Growthcurver summarizes the growth characteristics of microbial growth curve experiments conducted in a plate reader. The data are fitted to a standard form of the logistic equation, and the parameters have clear interpretations on population-level characteristics, like doubling time, carrying capacity, and growth rate. Conclusions Growthcurver is an easy-to-use R package available for installation from the Comprehensive R Archive Network (CRAN). The source code is available under the GNU General Public License and can be obtained from Github (Sprouffske K, Growthcurver sourcecode, 2016).

Journal Article

Share this book

Add to My Shelf

A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies

by Breeze, Charles E. , Zheng, Shijie C. , Teschendorff, Andrew E. in Algorithms , Bioinformatics , Biomedical and Life Sciences

2017

Background Intra-sample cellular heterogeneity presents numerous challenges to the identification of biomarkers in large Epigenome-Wide Association Studies (EWAS). While a number of reference-based deconvolution algorithms have emerged, their potential remains underexplored and a comparative evaluation of these algorithms beyond tissues such as blood is still lacking. Results Here we present a novel framework for reference-based inference, which leverages cell-type specific DNAse Hypersensitive Site (DHS) information from the NIH Epigenomics Roadmap to construct an improved reference DNA methylation database. We show that this leads to a marginal but statistically significant improvement of cell-count estimates in whole blood as well as in mixtures involving epithelial cell-types. Using this framework we compare a widely used state-of-the-art reference-based algorithm (called constrained projection) to two non-constrained approaches including CIBERSORT and a method based on robust partial correlations. We conclude that the widely-used constrained projection technique may not always be optimal. Instead, we find that the method based on robust partial correlations is generally more robust across a range of different tissue types and for realistic noise levels. We call the combined algorithm which uses DHS data and robust partial correlations for inference, EpiDISH ( Epi genetic D issection of I ntra- S ample H eterogeneity). Finally, we demonstrate the added value of EpiDISH in an EWAS of smoking. Conclusions Estimating cell-type fractions and subsequent inference in EWAS may benefit from the use of non-constrained reference-based cell-type deconvolution methods.

Journal Article

Share this book

Add to My Shelf

IPO: a tool for automated optimization of XCMS parameters

by Gander, Edgar , Madeo, Frank , Magnes, Christoph in Algorithms , Animals , Bioinformatics

2015

Background Untargeted metabolomics generates a huge amount of data. Software packages for automated data processing are crucial to successfully process these data. A variety of such software packages exist, but the outcome of data processing strongly depends on algorithm parameter settings. If they are not carefully chosen, suboptimal parameter settings can easily lead to biased results. Therefore, parameter settings also require optimization. Several parameter optimization approaches have already been proposed, but a software package for parameter optimization which is free of intricate experimental labeling steps, fast and widely applicable is still missing. Results We implemented the software package IPO (‘Isotopologue Parameter Optimization’) which is fast and free of labeling steps, and applicable to data from different kinds of samples and data from different methods of liquid chromatography - high resolution mass spectrometry and data from different instruments. IPO optimizes XCMS peak picking parameters by using natural, stable 13 C isotopic peaks to calculate a peak picking score. Retention time correction is optimized by minimizing relative retention time differences within peak groups. Grouping parameters are optimized by maximizing the number of peak groups that show one peak from each injection of a pooled sample. The different parameter settings are achieved by design of experiments, and the resulting scores are evaluated using response surface models. IPO was tested on three different data sets, each consisting of a training set and test set. IPO resulted in an increase of reliable groups (146% - 361%), a decrease of non-reliable groups (3% - 8%) and a decrease of the retention time deviation to one third. Conclusions IPO was successfully applied to data derived from liquid chromatography coupled to high resolution mass spectrometry from three studies with different sample types and different chromatographic methods and devices. We were also able to show the potential of IPO to increase the reliability of metabolomics data. The source code is implemented in R, tested on Linux and Windows and it is freely available for download at https://github.com/glibiseller/IPO . The training sets and test sets can be downloaded from https://health.joanneum.at/IPO .

Journal Article

Share this book

Add to My Shelf

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

by Fan, Suohai , Ma, Li in Algorithms , Analysis , Area Under Curve

2017

Background The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. Results We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. Conclusion The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.

Journal Article

Share this book

Add to My Shelf

ROBOT: A Tool for Automating Ontology Workflows

by Douglass, Eric , Harris, Nomi L. , Mungall, Christopher J. in Algorithms , Automation , Backup software

2019

Background Ontologies are invaluable in the life sciences, but building and maintaining ontologies often requires a challenging number of distinct tasks such as running automated reasoners and quality control checks, extracting dependencies and application-specific subsets, generating standard reports, and generating release files in multiple formats. Similar to more general software development, automation is the key to executing and managing these tasks effectively and to releasing more robust products in standard forms. For ontologies using the Web Ontology Language (OWL), the OWL API Java library is the foundation for a range of software tools, including the Protégé ontology editor. In the Open Biological and Biomedical Ontologies (OBO) community, we recognized the need to package a wide range of low-level OWL API functionality into a library of common higher-level operations and to make those operations available as a command-line tool. Results ROBOT (a recursive acronym for “ROBOT is an OBO Tool”) is an open source library and command-line tool for automating ontology development tasks. The library can be called from any programming language that runs on the Java Virtual Machine (JVM). Most usage is through the command-line tool, which runs on macOS, Linux, and Windows. ROBOT provides ontology processing commands for a variety of tasks, including commands for converting formats, running a reasoner, creating import modules, running reports, and various other tasks. These commands can be combined into larger workflows using a separate task execution system such as GNU Make, and workflows can be automatically executed within continuous integration systems. Conclusions ROBOT supports automation of a wide range of ontology development tasks, focusing on OBO conventions. It packages common high-level ontology development functionality into a convenient library, and makes it easy to configure, combine, and execute individual tasks in comprehensive, automated workflows. This helps ontology developers to efficiently create, maintain, and release high-quality ontologies, so that they can spend more time focusing on development tasks. It also helps guarantee that released ontologies are free of certain types of logical errors and conform to standard quality control checks, increasing the overall robustness and efficiency of the ontology development lifecycle.

Journal Article

Share this book

Add to My Shelf

Integrating omics datasets with the OmicsPLS package

by Houwing-Duistermaat, Jeanine , Kiełbasa, Szymon M. , Uh, Hae-Won in Algorithms , Analysis , Bioinformatics

2018

Background With the exponential growth in available biomedical data, there is a need for data integration methods that can extract information about relationships between the data sets. However, these data sets might have very different characteristics. For interpretable results, data-specific variation needs to be quantified. For this task, Two-way Orthogonal Partial Least Squares (O2PLS) has been proposed. To facilitate application and development of the methodology, free and open-source software is required. However, this is not the case with O2PLS. Results We introduce OmicsPLS , an open-source implementation of the O2PLS method in R. It can handle both low- and high-dimensional datasets efficiently. Generic methods for inspecting and visualizing results are implemented. Both a standard and faster alternative cross-validation methods are available to determine the number of components. A simulation study shows good performance of OmicsPLS compared to alternatives, in terms of accuracy and CPU runtime. We demonstrate OmicsPLS by integrating genetic and glycomic data. Conclusions We propose the OmicsPLS R package: a free and open-source implementation of O2PLS for statistical data integration. OmicsPLS is available at https://cran.r-project.org/package=OmicsPLS and can be installed in R via install.packages(“OmicsPLS”) .

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter