Catalogue Search | MBRL

binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions

by Lussier, Yves A. , Berghout, Joanne , Chiu, Wesley in Accuracy , Algorithms , Analysis

2020

Background In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest ( RF ) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the “P > > N ” high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions. Results In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone. Conclusion binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.

Journal Article

Share this book

Add to My Shelf

Evaluating single-subject study methods for personal transcriptomic interpretations to advance precision medicine

by Lussier, Yves A. , Berghout, Joanne , Kenost, Colleen in Analysis , Biomedical and Life Sciences , Biomedicine

2019

Background Gene expression profiling has benefited medicine by providing clinically relevant insights at the molecular candidate and systems levels. However, to adopt a more ‘precision’ approach that integrates individual variability including ‘omics data into risk assessments, diagnoses, and therapeutic decision making, whole transcriptome expression needs to be interpreted meaningfully for single subjects. We propose an “all-against-one” framework that uses biological replicates in isogenic conditions for testing differentially expressed genes (DEGs) in a single subject (ss) in the absence of an appropriate external reference standard or replicates. To evaluate our proposed “all-against-one” framework, we construct reference standards (RSs) with five conventional replicate-anchored analyses (NOISeq, DEGseq, edgeR, DESeq, DESeq2) and the remainder were treated separately as single-subject sample pairs for ss analyses (without replicates). Results Eight ss methods (NOISeq, DEGseq, edgeR, mixture model, DESeq, DESeq2, iDEG, and ensemble) for identifying genes with differential expression were compared in Yeast (parental line versus snf2 deletion mutant; n = 42/condition) and a MCF7 breast-cancer cell line (baseline versus stimulated with estradiol; n = 7/condition). Receiver-operator characteristic (ROC) and precision-recall plots were determined for eight ss methods against each of the five RSs in both datasets. Consistent with prior analyses of these data, ~ 50% and ~ 15% DEGs were obtained in Yeast and MCF7 datasets respectively, regardless of the RSs method. NOISeq, edgeR, and DESeq were the most concordant for creating a RS. Single-subject versions of NOISeq, DEGseq, and an ensemble learner achieved the best median ROC-area-under-the-curve to compare two transcriptomes without replicates regardless of the RS method and dataset (> 90% in Yeast, > 0.75 in MCF7). Further, distinct specific single-subject methods perform better according to different proportions of DEGs. Conclusions The “all-against-one” framework provides a honest evaluation framework for single-subject DEG studies since these methods are evaluated, by design, against reference standards produced by unrelated DEG methods. The ss-ensemble method was the only one to reliably produce higher accuracies in all conditions tested in this conservative evaluation framework. However, single-subject methods for identifying DEGs from paired samples need improvement, as no method performed with precision> 90% and obtained moderate levels of recall. http://www.lussiergroup.org/publications/EnsembleBiomarker

Journal Article

Share this book

Add to My Shelf

MOCHA’s advanced statistical modeling of scATAC-seq data enables functional genomic inference in large human cohorts

by McGrath, Imran , Li, Xiao-jun , Pebworth, Mark-Phillip in 631/114/1314 , 631/114/2114 , 631/114/2415

2024

Single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) is being increasingly used to study gene regulation. However, major analytical gaps limit its utility in studying gene regulatory programs in complex diseases. In response, MOCHA (Model-based single cell Open CHromatin Analysis) presents major advances over existing analysis tools, including: 1) improving identification of sample-specific open chromatin, 2) statistical modeling of technical drop-out with zero-inflated methods, 3) mitigation of false positives in single cell analysis, 4) identification of alternative transcription-starting-site regulation, and 5) modules for inferring temporal gene regulatory networks from longitudinal data. These advances, in addition to open chromatin analyses, provide a robust framework after quality control and cell labeling to study gene regulatory programs in human disease. We benchmark MOCHA with four state-of-the-art tools to demonstrate its advances. We also construct cross-sectional and longitudinal gene regulatory networks, identifying potential mechanisms of COVID-19 response. MOCHA provides researchers with a robust analytical tool for functional genomic inference from scATAC-seq data. Analytical gaps limit the utility of scATAC-seq for studying gene regulatory programs in human disease. Here, authors describe MOCHA, a robust analytical tool with advanced statistical modelling that enables functional genomic inference in large cross-sectional and longitudinal human studies.

Journal Article

Share this book

Add to My Shelf

Correction to: binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions

by Lussier, Yves A. , Berghout, Joanne , Chiu, Wesley in Algorithms , Bioinformatics , Biomedical and Life Sciences

2020

[...]iterative random forests (iRF) [58] identify decision paths along random forests and captures their prevalence, therefore benefitting from a combinatoric feature space reduction in the interaction search. Rights and permissions Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. BMC Bioinformatics 21, 495 (2020). https://doi.org/10.1186/s12859-020-03822-w Download citation * Published: 02 November 2020 * DOI: https://doi.org/10.1186/s12859-020-03822-w Correction Open Access Published:02 November 2020 [/RAW_REF_TEXT] Correction to: binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions [RAW_REF_TEXT] Samir Rachid Zaim1,2,3 , Colleen Kenost1,3 , Joanne Berghout1,3 , Wesley Chiu1,3 , Liam Wilson1,3 , Hao Helen Zhang 1,2,4 & Yves A. Lussier 1,2,3,5,6,7 [/RAW_REF_TEXT] BMC Bioinformatics volume 21, Article number: 495 (2020) Cite this article [RAW_REF_TEXT] 166 Accesses 1 Altmetric Metrics details The original article was published in BMC Bioinformatics 2020 21:374

Journal Article

Share this book

Add to My Shelf

Personalized beyond Precision: Designing Unbiased Gold Standards to Improve Single-Subject Studies of Personal Genome Dynamics from Gene Products

by Lussier, Yves A. , Kenost, Colleen , Zhang, Hao Helen in Accuracy , Breast cancer , Datasets

2020

Background: Developing patient-centric baseline standards that enable the detection of clinically significant outlier gene products on a genome-scale remains an unaddressed challenge required for advancing personalized medicine beyond the small pools of subjects implied by “precision medicine”. This manuscript proposes a novel approach for reference standard development to evaluate the accuracy of single-subject analyses of transcriptomes and offers extensions into proteomes and metabolomes. In evaluation frameworks for which the distributional assumptions of statistical testing imperfectly model genome dynamics of gene products, artefacts and biases are confounded with authentic signals. Model confirmation biases escalate when studies use the same analytical methods in the discovery sets and reference standards. In such studies, replicated biases are confounded with measures of accuracy. We hypothesized that developing method-agnostic reference standards would reduce such replication biases. We propose to evaluate discovery methods with a reference standard derived from a consensus of analytical methods distinct from the discovery one to minimize statistical artefact biases. Our methods involve thresholding effect-size and expression-level filtering of results to improve consensus between analytical methods. We developed and released an R package “referenceNof1” to facilitate the construction of robust reference standards. Results: Since RNA-Seq data analysis methods often rely on binomial and negative binomial assumptions to non-parametric analyses, the differences create statistical noise and make the reference standards method dependent. In our experimental design, the accuracy of 30 distinct combinations of fold changes (FC) and expression counts (hereinafter “expression”) were determined for five types of RNA analyses in two different datasets. This design was applied to two distinct datasets: Breast cancer cell lines and a yeast study with isogenic biological replicates in two experimental conditions. Furthermore, the reference standard (RS) comprised all RNA analytical methods with the exception of the method testing accuracy. To mitigate biases towards a specific analytical method, the pairwise Jaccard Concordance Index between observed results of distinct analytical methods were calculated for optimization. Optimization through thresholding effect-size and expression-level reduced the greatest discordances between distinct methods’ analytical results and resulted in a 65% increase in concordance. Conclusions: We have demonstrated that comparing accuracies of different single-subject analysis methods for clinical optimization in transcriptomics requires a new evaluation framework. Reliable and robust reference standards, independent of the evaluated method, can be obtained under a limited number of parameter combinations: Fold change (FC) ranges thresholds, expression level cutoffs, and exclusion of the tested method from the RS development process. When applying anticonservative reference standard frameworks (e.g., using the same method for RS development and prediction), most of the concordant signal between prediction and Gold Standard (GS) cannot be confirmed by other methods, which we conclude as biased results. Statistical tests to determine DEGs from a single-subject study generate many biased results requiring subsequent filtering to increase reliability. Conventional single-subject studies pertain to one or a few patient’s measures over time and require a substantial conceptual framework extension to address the numerous measures in genome-wide analyses of gene products. The proposed referenceNof1 framework addresses some of the inherent challenges for improving transcriptome scale single-subject analyses by providing a robust approach to constructing reference standards.

Journal Article

Share this book

Add to My Shelf

Seeking mental-health help was the best thing I did at grad school

by Rachid Zaim, Samir

2020

Journal Article

Share this book

Add to My Shelf

How a team of Venezuelan expats is fighting COVID-19 at home

by Figueira, Johanna , Zaim, Samir Rachid

2020

Journal Article

Share this book

Add to My Shelf

Trimodal single-cell profiling reveals a novel pediatric CD8αα+ T cell subset and broad age-related molecular reprogramming across the T cell compartment

by Henrickson, Sarah E. , Graybuck, Lucas T. , Buckner, Jane H. in 631/1647/2210/2211 , 631/1647/514/1949 , 631/1647/514/2254

2023

Age-associated changes in the T cell compartment are well described. However, limitations of current single-modal or bimodal single-cell assays, including flow cytometry, RNA-seq (RNA sequencing) and CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing), have restricted our ability to deconvolve more complex cellular and molecular changes. Here, we profile >300,000 single T cells from healthy children (aged 11–13 years) and older adults (aged 55–65 years) by using the trimodal assay TEA-seq (single-cell analysis of mRNA transcripts, surface protein epitopes and chromatin accessibility), which revealed that molecular programming of T cell subsets shifts toward a more activated basal state with age. Naive CD4 + T cells, considered relatively resistant to aging, exhibited pronounced transcriptional and epigenetic reprogramming. Moreover, we discovered a novel CD8αα + T cell subset lost with age that is epigenetically poised for rapid effector responses and has distinct inhibitory, costimulatory and tissue-homing properties. Together, these data reveal new insights into age-associated changes in the T cell compartment that may contribute to differential immune responses. Using TEA-seq, Thomson et al. detail transcriptional and epigenetic alterations in the T cell compartment between healthy children and older adults, leading to the discovery of a novel pediatric CD8αα + population poised for rapid effector responses.

Journal Article

Share this book

Add to My Shelf

Author Correction: Trimodal single-cell profiling reveals a novel pediatric CD8αα+ T cell subset and broad age-related molecular reprogramming across the T cell compartment

by Henrickson, Sarah E. , Graybuck, Lucas T. , Buckner, Jane H. in 631/1647/2210/2211 , 631/1647/514/1949 , 631/1647/514/2254

2024

Journal Article

Share this book

Add to My Shelf

Interpretable and Robust Machine Learning for Precision Medicine

by Rachid Zaim, Samir in Statistics

2021

This dissertation represents the unification of the body of research produced throughout my doctoral training, highlighting three major articles. These projects revolved around how refining and advancing algorithmic methodologies and frameworks in statistics and machine learning (ML) can improve experimental designs and analyses in genomics and transcriptomics for paving the road to interpretable and robust machine learning for precision medicine. The challenges in the Omics field of ML lies with noisy signal-to noise ratio and a curse of dimensionality. Throughout this dissertation, one constant theme is demonstrating how feature reductions and improved signal to noise ratio with the use of gene sets (ontologies). This dissertation can be succinctly described as ontology-anchored dimension reduction, combined with single subject (N-of-1) analytics and machine learning applied to transcriptomics. The culmination of these projects is a final pilot study that brings together these concepts to create robust and interpretable machine learning classifiers for precision medicine that can be enriched to identify pathways and their interactions.In precision medicine, the goal is to deliver:The right treatment, at the right time, for the right person.The aim of my doctoral research is to continue advancing precision medicine by developing cutting-edge statistical and machine learning software and frameworks to improve the state-of-the-art technology available. Building upon the works of colleagues, advisors, and others, this dissertation represents comprehensive efforts from a variety of scientific domains such as informatics, computer science, biology, genetics, mathematics, and last but not least, statistics. Common themes include experimental designs and evaluations, ontologies and knowledge graphs, large-scale significance testing, correlation structures, ensemble learners, and random forests. The first chapter introduces the logistics of the scientific dissertation structure. In the second chapter, a numerical study illustrates the increased ability to detect individualized differential gene expression when we aggregate signal using gene ontologies to group genes by their biological processes. The third chapter borrows from machine learning and mathematics to optimize small-sample and single-subject studies in genomics, while a third study is presented in Chapter 4, introducing a novel, effective, and scalable feature selection machine learning algorithm to identify differential gene products and interactions by combining random forests and correlated Bernoulli trials for large-scale hypothesis testing. The final chapter presents a pilot study that combines all these projects into a proof-of-concept of how to create robust and interpretable machine learning classifiers in small-sample studies for precision medicine. These techniques were all developed and applied to analyze Next Generation Sequencing (NGS) and RNA-sequencing data derived from samples in cohort studies, and their biological mechanisms were incorporated from gene ontologies. As is implicit in these works, they represent an interdisciplinary effort that is only possible in team science, allowing for creative solutions when the best minds in statistics, computer science, mathematics, biology, and medicine come together to work on the same problem.

Dissertation

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter