Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
19 result(s) for "regularized variable selection"
Sort by:
Is Seeing Believing? A Practitioner’s Perspective on High-Dimensional Statistical Inference in Cancer Genomics Studies
Variable selection methods have been extensively developed for and applied to cancer genomics data to identify important omics features associated with complex disease traits, including cancer outcomes. However, the reliability and reproducibility of the findings are in question if valid inferential procedures are not available to quantify the uncertainty of the findings. In this article, we provide a gentle but systematic review of high-dimensional frequentist and Bayesian inferential tools under sparse models which can yield uncertainty quantification measures, including confidence (or Bayesian credible) intervals, p values and false discovery rates (FDR). Connections in high-dimensional inferences between the two realms have been fully exploited under the “unpenalized loss function + penalty term” formulation for regularization methods and the “likelihood function × shrinkage prior” framework for regularized Bayesian analysis. In particular, we advocate for robust Bayesian variable selection in cancer genomics studies due to its ability to accommodate disease heterogeneity in the form of heavy-tailed errors and structured sparsity while providing valid statistical inference. The numerical results show that robust Bayesian analysis incorporating exact sparsity has yielded not only superior estimation and identification results but also valid Bayesian credible intervals under nominal coverage probabilities compared with alternative methods, especially in the presence of heavy-tailed model errors and outliers.
Re-evaluation of the comparative effectiveness of bootstrap-based optimism correction methods in the development of multivariable clinical prediction models
Background Multivariable prediction models are important statistical tools for providing synthetic diagnosis and prognostic algorithms based on patients’ multiple characteristics. Their apparent measures for predictive accuracy usually have overestimation biases (known as ‘optimism’) relative to the actual performances for external populations. Existing statistical evidence and guidelines suggest that three bootstrap-based bias correction methods are preferable in practice, namely Harrell’s bias correction and the .632 and .632+ estimators. Although Harrell’s method has been widely adopted in clinical studies, simulation-based evidence indicates that the .632+ estimator may perform better than the other two methods. However, these methods’ actual comparative effectiveness is still unclear due to limited numerical evidence. Methods We conducted extensive simulation studies to compare the effectiveness of these three bootstrapping methods, particularly using various model building strategies: conventional logistic regression, stepwise variable selections, Firth’s penalized likelihood method, ridge, lasso, and elastic-net regression. We generated the simulation data based on the Global Utilization of Streptokinase and Tissue plasminogen activator for Occluded coronary arteries (GUSTO-I) trial Western dataset and considered how event per variable, event fraction, number of candidate predictors, and the regression coefficients of the predictors impacted the performances. The internal validity of C -statistics was evaluated. Results Under relatively large sample settings (roughly, events per variable ≥ 10), the three bootstrap-based methods were comparable and performed well. However, all three methods had biases under small sample settings, and the directions and sizes of biases were inconsistent. In general, Harrell’s and .632 methods had overestimation biases when event fraction become lager. Besides, .632+ method had a slight underestimation bias when event fraction was very small. Although the bias of the .632+ estimator was relatively small, its root mean squared error (RMSE) was comparable or sometimes larger than those of the other two methods, especially for the regularized estimation methods. Conclusions In general, the three bootstrap estimators were comparable, but the .632+ estimator performed relatively well under small sample settings, except when the regularized estimation methods are adopted.
Detecting prognostic biomarkers of breast cancer by regularized Cox proportional hazards models
Background The successful identification of breast cancer (BRCA) prognostic biomarkers is essential for the strategic interference of BRCA patients. Recently, various methods have been proposed for exploring a small prognostic gene set that can distinguish the high-risk group from the low-risk group. Methods Regularized Cox proportional hazards (RCPH) models were proposed to discover prognostic biomarkers of BRCA from gene expression data. Firstly, the maximum connected network with 1142 genes by mapping 956 differentially expressed genes (DEGs) and 677 previously BRCA-related genes into the gene regulatory network (GRN) was constructed. Then, the 72 union genes of the four feature gene sets identified by Lasso-RCPH, Enet-RCPH, L 0 -RCPH and SCAD-RCPH models were recognized as the robust prognostic biomarkers. These biomarkers were validated by literature checks, BRCA-specific GRN and functional enrichment analysis. Finally, an index of prognostic risk score (PRS) for BRCA was established based on univariate and multivariate Cox regression analysis. Survival analysis was performed to investigate the PRS on 1080 BRCA patients from the internal validation. Particularly, the nomogram was constructed to express the relationship between PRS and other clinical information on the discovery dataset. The PRS was also verified on 1848 BRCA patients of ten external validation datasets or collected cohorts. Results The nomogram highlighted that the importance of PRS in guiding significance for the prognosis of BRCA patients. In addition, the PRS of 301 normal samples and 306 tumor samples from five independent datasets showed that it is significantly higher in tumors than in normal tissues ( P < 0.05 ). The protein expression profiles of the three genes, i.e., ADRB1 , SAV1 and TSPAN14 , involved in the PRS model demonstrated that the latter two genes are more strongly stained in tumor specimens. More importantly, external validation illustrated that the high-risk group has worse survival than the low-risk group ( P < 0.05 ) in both internal and external validations. Conclusions The proposed pipelines of detecting and validating prognostic biomarker genes for BRCA are effective and efficient. Moreover, the proposed PRS is very promising as an important indicator for judging the prognosis of BRCA patients.
BHCox: Bayesian heredity-constrained Cox proportional hazards models for detecting gene-environment interactions
Background Gene-environment (G × E) interactions play a critical role in understanding the etiology of diseases and exploring the factors that affect disease prognosis. There are several challenges in detecting G × E interactions for censored survival outcomes, such as the high dimensionality, complexity of environmental effects, and specificity of survival analysis. The effect heredity, which incorporates the dependence of the main effects and interactions in the analysis, has been widely applied in the study of interaction detection. However, it has not yet been applied to Bayesian Cox proportional hazards models for detecting interactions for censored survival outcomes. Results In this study, we propose Bayesian heredity-constrained Cox proportional hazards (BHCox) models with novel spike-and-slab and regularized horseshoe priors that incorporate effect heredity to identify and estimate the main and interaction effects. The no-U-turn sampler (NUTS) algorithm, which has been implemented in the R package brms , was used to fit the proposed model. Extensive simulations were performed to evaluate and compare our proposed approaches with other alternative models. The simulation studies illustrated that BHCox models outperform other alternative models. We applied the proposed method to real data of non-small-cell lung cancer (NSCLC) and identified biologically plausible G × smoking interactions associated with the prognosis of patients with NSCLC. Conclusions In summary, BHCox can be used to detect the main effects and interactions and thus have significant implications for the discovery of high-dimensional interactions in censored survival outcome data.
A novel wavelength interval selection based on split regularized regression for spectroscopic data
Wavelength selection has become a critical step in the analysis for near-infrared (NIR) spectroscopy with high co-linearity and large number of spectral variables. In this study, a novel wavelength interval selection method based on split regularized regression and partial least squares (SplitReg-PLS) is developed. SplitReg-PLS is a two-step approach, which combines the advantage of the SplitReg and PLS methods. SplitReg presents interesting properties, which can split the variables into groups and pool the regularized estimation of the regression coefficients together as groups. The PLS regression is one of the most popular methods for multivariate calibration, and is performed on the selected group variables by using the SplitReg. The SplitReg-PLS method can automatically select successive strongly correlated and interpretable spectral variables related to the response, which provides a flexible framework for variable selection. The performance of the proposed procedure is evaluated by three real NIR datasets. The results indicate that SplitReg-PLS is a good wavelength interval selection strategy.
Inferring Diagnostic and Prognostic Gene Expression Signatures Across WHO Glioma Classifications: A Network-Based Approach
Tumor heterogeneity is a challenge to designing effective and targeted therapies. Glioma-type identification depends on specific molecular and histological features, which are defined by the official World Health Organization (WHO) classification of the central nervous system (CNS). These guidelines are constantly updated to support the diagnosis process, which affects all the successive clinical decisions. In this context, the search for new potential diagnostic and prognostic targets, characteristic of each glioma type, is crucial to support the development of novel therapies. Based on The Cancer Genome Atlas (TCGA) glioma RNA-sequencing data set updated according to the 2016 and 2021 WHO guidelines, we proposed a 2-step variable selection approach for biomarker discovery. Our framework encompasses the graphical lasso algorithm to estimate sparse networks of genes carrying diagnostic information. These networks are then used as input for regularized Cox survival regression model, allowing the identification of a smaller subset of genes with prognostic value. In each step, the results derived from the 2016 and 2021 classes were discussed and compared. For both WHO glioma classifications, our analysis identifies potential biomarkers, characteristic of each glioma type. Yet, better results were obtained for the WHO CNS classification in 2021, thereby supporting recent efforts to include molecular data on glioma classification.
Bayesian Model Averaging and Regularized Regression as Methods for Data-Driven Model Exploration, with Practical Considerations
Methodological experts suggest that psychological and educational researchers should employ appropriate methods for data-driven model exploration, such as Bayesian Model Averaging and regularized regression, instead of conventional hypothesis-driven testing, if they want to explore the best prediction model. I intend to discuss practical considerations regarding data-driven methods for end-user researchers without sufficient expertise in quantitative methods. I tested three data-driven methods, i.e., Bayesian Model Averaging, LASSO as a form of regularized regression, and stepwise regression, with datasets in psychology and education. I compared their performance in terms of cross-validity indicating robustness against overfitting across different conditions. I employed functionalities widely available via R with default settings to provide information relevant to end users without advanced statistical knowledge. The results demonstrated that LASSO showed the best performance and Bayesian Model Averaging outperformed stepwise regression when there were many candidate predictors to explore. Based on these findings, I discussed appropriately using the data-driven model exploration methods across different situations from laypeople’s perspectives.
Forward Selection of Relevant Factors by Means of MDR-EFE Method
The suboptimal procedure under consideration, based on the MDR-EFE algorithm, provides sequential selection of relevant (in a sense) factors affecting the studied, in general, non-binary random response. The model is not assumed linear, the joint distribution of the factors vector and response is unknown. A set of relevant factors has specified cardinality. It is proved that under certain conditions the mentioned forward selection procedure gives a random set of factors that asymptotically (with probability tending to one as the number of observations grows to infinity) coincides with the “oracle” one. The latter means that the random set, obtained with this algorithm, approximates the features collection that would be identified, if the joint distribution of the features vector and response were known. For this purpose the statistical estimators of the prediction error functional of the studied response are proposed. They involve a new version of regularization. This permits to guarantee not only the central limit theorem for normalized estimators, but also to find the convergence rate of their first two moments to the corresponding moments of the limiting Gaussian variable.
Examining the Spectral Separability of Prosopis glandulosa from Co-Existent Species Using Field Spectral Measurement and Guided Regularized Random Forest
The invasive taxa of Prosopis is rated the world’s top 100 unwanted species, and a lack of spatial data about the invasion dynamics has made the current control and monitoring methods unsuccessful. This study thus tests the use of in situ spectroscopy data with a newly-developed algorithm, guided regularized random forest (GRRF), to spectrally discriminate Prosopis from coexistent acacia species (Acacia karroo, Acacia mellifera and Ziziphus mucronata) in the arid environment of South Africa. Results show that GRRF was able to reduce the high dimensionality of the spectroscopy data and select key wavelengths (n = 11) for discriminating amongst the species. These wavelengths are located at 356.3 nm, 468.5 nm, 531.1 nm, 665.2 nm, 1262.3 nm, 1354.1 nm, 1361.7 nm, 1376.9 nm, 1407.1 nm, 1410.9 nm and 1414.6 nm. The use of these selected wavelengths increases the overall classification accuracy from 79.19% and a Kappa value of 0.7201 when using all wavelengths to 88.59% and a Kappa of 0.8524 when the selected wavelengths were used. Based on our relatively high accuracies and ease of use, it is worth considering the GRRF method for reducing the high dimensionality of spectroscopy data. However, this assertion should receive considerable additional testing and comparison before it is accepted as a substitute for reliable high dimensionality reduction.
Classification tree algorithm for grouped variables
We consider the problem of predicting a categorical variable based on groups of inputs. Some methods have already been proposed to elaborate classification rules based on groups of variables (e.g. group lasso for logistic regression). However, to our knowledge, no tree-based approach has been proposed to tackle this issue. Here, we propose the Tree Penalized Linear Discriminant Analysis algorithm (TPLDA), a new-tree based approach which constructs a classification rule based on groups of variables. It consists in splitting a node by repeatedly selecting a group and then applying a regularized linear discriminant analysis based on this group. This process is repeated until some stopping criterion is satisfied. A pruning strategy is proposed to select an optimal tree. Compared to the existing multivariate classification tree methods, the proposed method is computationally less demanding and the resulting trees are more easily interpretable. Furthermore, TPLDA automatically provides a measure of importance for each group of variables. This score allows to rank groups of variables with respect to their ability to predict the response and can also be used to perform group variable selection. The good performances of the proposed algorithm and its interest in terms of prediction accuracy, interpretation and group variable selection are loud and compared to alternative reference methods through simulations and applications on real datasets.