Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
7 result(s) for "High-dimensional statistics and omics data analysis"
Sort by:
A clustering-stratified cross-validation framework for validating omics survival models: application to head and neck cancer
Background This study tackles the challenge of developing reliable prognostic models for time-to-event (TTE) outcomes using high-dimensional omics data in head and neck cancers. Resampling methods, particularly nested cross-validation, are considered as standard for model hyperparameter selection and performance evaluation. When handling clustered data, balancing the random partition of the cross-validation folds to minimize optimism bias and instability could be tested. This work compares the performance of three nested cross-validation implementations, including random assignment of the folds, clustering-based resampling, and internal-external validation using an hold out approach. Method We analyzed two head and neck squamous cell carcinoma (HNSCC) cohorts: The Cancer Genome Atlas (TCGA) and SCANDARE (NCT03017573), with clinical data and transcriptomic data normalized as log-transcripts per million. Three model selection methods LASSO, IPF-Lasso, and Priority-LASSO were evaluated within five nested cross-validation frameworks: Standard nested cross-validation, Clustering-based nested-cross validation, nested-cross validation with Combat correction, Nested cross-validation for optimization combined with hold-out for validation, Nested cross-validation for optimization combined with hold-out and ComBat correction for validation. Predictive performance was assessed using 3-year AUC and Integrated Brier Score (IBS). Results We analyzed data from 581 patients (mean age 61.0 years, 33.6% female) across TCGA-HNSC ( n  = 505) and SCANDARE ( n  = 76). Clustering analyses, using UMAP and k-means, identified three transcriptomic clusters. Validation strategies demonstrated reduced instability for Lasso ( p  < 0.001), IPF-Lasso ( p  < 0.001) and Priority-lasso ( p  < 0.001) without apparent optimism in discrimination and calibration metrics with stratified nested cross-validation (SNCV), supporting its utility. As an application using IPF-Lasso Cox models with SNCV, we integrated clinical and transcriptomic data, selecting 35 prognosis variables of head and neck carcinomas. This model achieved a 3-year AUC of 0.71 and IBS of 0.08. Conclusion Clustering-based nested cross-validation combined with stratified cross-validation offers a robust compromise for developing high-dimensional survival models and evaluating their predictive performance. This approach leverages clustering-derived stratification to balance heterogeneity in the dataset within cross-validation folds, although the training and test sets remain derived from the pooled dataset rather than fully independent cohorts.
3D IntelliGenes: AI/ML application using multi-omics data for biomarker discovery and disease prediction with multi-dimensional visualization
Background The cutting-edge artificial intelligence (AI) and machine learning (ML) techniques have proven effective at uncovering elucidative knowledge on disease-causing biomarkers and the biological underpinnings of a plethora of human diseases. However, the high-dimensional nature of multi-omics data presents numerous challenges in its effective presentation, annotation, and interpretation. Traditional 2D visualizations often fall short in capturing the intricate relationships between multi-omics features, hindering our ability to identify meaningful correlations. Methods In this study, we focused on addressing such challenges by developing an innovative solution to better visualize results produced by AI/ML approaches on integrated clinical and multi-omics data for novel biomarker discovery and predictive analysis. We present an advanced version of our earlier published software with intuitive and interactive visualizations of multi-omics data in multi-dimensions i.e., 3D IntelliGenes , which offers deeper insights, most importantly by capturing greater variability in the patient data by understanding both linear and non-linear structures, evaluating AI/ML model performance, and delineating the joint impact of biomarkers on the corresponding disease states. Results The overall functionality of 3D IntelliGenes is divided into two modules, data clustering and feature plotting. The data clustering module creates configurable 3D scatter plots to visualize the structure-preserving distribution of disease states, AI/ML classifier bias in the form of type I/II errors, and patient similarity through a robust density-driven clustering algorithm. Whereas the feature plotting module supports the joint analysis of pairs of multi-omics features to analyze the interdependence and discriminative power of co-expressed biomarkers. Conclusion We report evaluated performance of 3D IntelliGenes using diverse cohorts of patients with cardiovascular and other diseases.
A novel statistical feature selection framework for biomarker discovery and cancer classification via multiomics integration
Background Early cancer diagnosis is essential for improving prognosis and guiding treatment. However, the high dimensionality and complexity of omics data present major challenges. Computational approaches that extract stable biomarkers and enable reliable classification across cancer types and stages are needed. Methods A novel feature selection method, sDCFE (synergistic Discriminative Cluster-based Feature Extraction), was developed by extending Fisher-like variance analysis with a median absolute deviation (MAD) regularization term and a cluster separation component to enhance robustness and interpretability. Features selected by sDCFE were compared with those obtained from XGBoost, and the intersected set of 82 genes was evaluated through functional enrichment (KEGG, Reactome, GO BP), survival analysis (Kaplan–Meier, Cox regression), and biomarker novelty assessment against six external resources. Hybrid classification models integrating XGBoost, sDCFE, and deep learning were applied to pancancer classification, and the framework was further extended to lung squamous cell carcinoma (LUSC) staging using RNA-seq and methylation data. Results The overlap between sDCFE and XGBoost yielded 82 candidate biomarkers enriched in cancer-related pathways, including cell cycle regulation, immune signalling, and DNA repair. Novelty assessment stratified these genes into established, emerging, and novel categories. Six genes—HFE2, LOC339674, SERINC2, SFTA3, SOX2OT, and ACPP—emerged as the most promising candidates, supported by enrichment and survival associations across multiple cancers. The hybrid model achieved near-perfect pancancer classification on TCGA (accuracy = 99.3%, MCC = 0.992, AUC = 1.0) and demonstrated strong generalizability on PCAWG (accuracy = 94%, MCC = 0.929, AUC = 0.997). In the LUSC staging task, multiomics integration improved classification performance: the CNN-based model reached 84% accuracy, while logistic regression applied to sDCFE-ranked features achieved 88.5% accuracy with superior calibration, highlighting the robustness of the selected features. Conclusion sDCFE provides a principled extension of Fisher-like methods, enabling stable and interpretable biomarker selection. When combined with XGBoost and deep learning, the framework achieves highly accurate and biologically grounded cancer classification across both cancer types and stages. The identification of novel and prognostic biomarkers, including HFE2, LOC339674, SERINC2, SFTA3, SOX2OT, and ACPP, underscores its translational potential. These results position the framework as a promising precision oncology tool to support early diagnosis, risk stratification, and treatment decision-making.
PAM clustering algorithm based on mutual information matrix for ATR-FTIR spectral feature selection and disease diagnosis
The ATR-FTIR spectral data represent a valuable source of information in a wide range of pathologies, including neurological disorders, and can be used for disease discrimination. To this end, the identification of the potential spectral biomarkers among all possible candidates is needed, but the amount of information characterizing the spectral dataset and the presence of redundancy among data could make the selection of the more informative features cumbersome. Here, a novel approach is proposed to perform feature selection based on redundant information among spectral data. In particular, we consider the Partition Around Medoids algorithm based on a dissimilarity matrix obtained from mutual information measure, in order to obtain groups of variables (wavenumbers) having similar patterns of pairwise dependence. Indeed, an advantage of this grouping algorithm with respect to other more widely used clustering methods, is to facilitate the interpretation of results, since the centre of each cluster, the so-called medoid, corresponds to an observed data point. As a consequence, the obtained medoid can be considered as representative of the whole wavenumbers belonging to the same cluster and retained in the subsequent statistical methods for disease prediction. An application on real data is finally reported to show the ability of the proposed approach in discriminating between patients affected by multiple sclerosis and healthy subjects.
Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges
Background In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
Statistical learning approaches in the genetic epidemiology of complex diseases
In this paper, we give an overview of methodological issues related to the use of statistical learning approaches when analyzing high-dimensional genetic data. The focus is set on regression models and machine learning algorithms taking genetic variables as input and returning a classification or a prediction for the target variable of interest; for example, the present or future disease status, or the future course of a disease. After briefly explaining the basic motivation and principle of these methods, we review different procedures that can be used to evaluate the accuracy of the obtained models and discuss common flaws that may lead to over-optimistic conclusions with respect to their prediction performance and usefulness.
Multiset sparse partial least squares path modeling for high dimensional omics data analysis
Background Recent technological developments have enabled the measurement of a plethora of biomolecular data from various omics domains, and research is ongoing on statistical methods to leverage these omics data to better model and understand biological pathways and genetic architectures of complex phenotypes. Current reviews report that the simultaneous analysis of multiple (i.e. three or more) high dimensional omics data sources is still challenging and suitable statistical methods are unavailable. Often mentioned challenges are the lack of accounting for the hierarchical structure between omics domains and the difficulty of interpretation of genomewide results. This study is motivated to address these challenges. We propose multiset sparse Partial Least Squares path modeling (msPLS), a generalized penalized form of Partial Least Squares path modeling, for the simultaneous modeling of biological pathways across multiple omics domains. msPLS simultaneously models the effect of multiple molecular markers, from multiple omics domains, on the variation of multiple phenotypic variables, while accounting for the relationships between data sources, and provides sparse results. The sparsity in the model helps to provide interpretable results from analyses of hundreds of thousands of biomolecular variables. Results With simulation studies, we quantified the ability of msPLS to discover associated variables among high dimensional data sources. Furthermore, we analysed high dimensional omics datasets to explore biological pathways associated with Marfan syndrome and with Chronic Lymphocytic Leukaemia. Additionally, we compared the results of msPLS to the results of Multi-Omics Factor Analysis (MOFA), which is an alternative method to analyse this type of data. Conclusions msPLS is an multiset multivariate method for the integrative analysis of multiple high dimensional omics data sources. It accounts for the relationship between multiple high dimensional data sources while it provides interpretable results through its sparse solutions. The biomarkers found by msPLS in the omics datasets can be interpreted in terms of biological pathways associated with the pathophysiology of Marfan syndrome and of Chronic Lymphocytic Leukaemia. Additionally, msPLS outperforms MOFA in terms of variation explained in the chronic lymphocytic leukaemia dataset while it identifies the two most important clinical markers for Chronic Lymphocytic Leukaemia Availability http://uva.csala.me/mspls . https://github.com/acsala/2018_msPLS