Catalogue Search | MBRL

Points of Significance: Ensemble methods: bagging and random forests

by Altman, Naomi , Krzywinski, Martin in Analysis , Bootstrapping (Statistics) , Classification

2017

Journal Article

Share this book

Add to My Shelf

Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations

by Huang, Samuel Y. , Huang, Alexander A. in Accuracy , Algorithms , Angina

2023

Machine learning methods are widely used within the medical field. However, the reliability and efficacy of these models is difficult to assess, making it difficult for researchers to identify which machine-learning model to apply to their dataset. We assessed whether variance calculations of model metrics (e.g., AUROC, Sensitivity, Specificity) through bootstrap simulation and SHapely Additive exPlanations (SHAP) could increase model transparency and improve model selection. Data from the England National Health Services Heart Disease Prediction Cohort was used. After comparison of model metrics for XGBoost, Random Forest, Artificial Neural Network, and Adaptive Boosting, XGBoost was used as the machine-learning model of choice in this study. Boost-strap simulation (N = 10,000) was used to empirically derive the distribution of model metrics and covariate Gain statistics. SHapely Additive exPlanations (SHAP) to provide explanations to machine-learning output and simulation to evaluate the variance of model accuracy metrics. For the XGBoost modeling method, we observed (through 10,000 completed simulations) that the AUROC ranged from 0.771 to 0.947, a difference of 0.176, the balanced accuracy ranged from 0.688 to 0.894, a 0.205 difference, the sensitivity ranged from 0.632 to 0.939, a 0.307 difference, and the specificity ranged from 0.595 to 0.944, a 0.394 difference. Among 10,000 simulations completed, we observed that the gain for Angina ranged from 0.225 to 0.456, a difference of 0.231, for Cholesterol ranged from 0.148 to 0.326, a difference of 0.178, for maximum heart rate (MaxHR) ranged from 0.081 to 0.200, a range of 0.119, and for Age ranged from 0.059 to 0.157, difference of 0.098. Use of simulations to empirically evaluate the variability of model metrics and explanatory algorithms to observe if covariates match the literature are necessary for increased transparency, reliability, and utility of machine learning methods. These variance statistics, combined with model accuracy statistics can help researchers identify the best model for a given dataset.

Journal Article

Share this book

Add to My Shelf

MPBoot: fast phylogenetic maximum parsimony tree inference and bootstrap approximation

by Hoang, Diep Thi , Stamatakis, Alexandros , Vinh, Le Sy in Analysis , Animal Systematics/Taxonomy/Biogeography , Biomedical and Life Sciences

2018

Background The nonparametric bootstrap is widely used to measure the branch support of phylogenetic trees. However, bootstrapping is computationally expensive and remains a bottleneck in phylogenetic analyses. Recently, an ultrafast bootstrap approximation (UFBoot) approach was proposed for maximum likelihood analyses. However, such an approach is still missing for maximum parsimony. Results To close this gap we present MPBoot, an adaptation and extension of UFBoot to compute branch supports under the maximum parsimony principle. MPBoot works for both uniform and non-uniform cost matrices. Our analyses on biological DNA and protein showed that under uniform cost matrices, MPBoot runs on average 4.7 (DNA) to 7 times (protein data) (range: 1.2–20.7) faster than the standard parsimony bootstrap implemented in PAUP*; but 1.6 (DNA) to 4.1 times (protein data) slower than the standard bootstrap with a fast search routine in TNT (fast-TNT). However, for non-uniform cost matrices MPBoot is 5 (DNA) to 13 times (protein data) (range:0.3–63.9) faster than fast-TNT. We note that MPBoot achieves better scores more frequently than PAUP* and fast-TNT. However, this effect is less pronounced if an intensive but slower search in TNT is invoked. Moreover, experiments on large-scale simulated data show that while both PAUP* and TNT bootstrap estimates are too conservative, MPBoot bootstrap estimates appear more unbiased. Conclusions MPBoot provides an efficient alternative to the standard maximum parsimony bootstrap procedure. It shows favorable performance in terms of run time, the capability of finding a maximum parsimony tree, and high bootstrap accuracy on simulated as well as empirical data sets. MPBoot is easy-to-use, open-source and available at http://www.cibiv.at/software/mpboot .

Journal Article

Share this book

Add to My Shelf

Clinical prediction models and the multiverse of madness

by Archer, Lucinda , Pate, Alexander , Riley, Richard D. in Analysis , Biomedicine , Blood pressure

2023

Background Each year, thousands of clinical prediction models are developed to make predictions (e.g. estimated risk) to inform individual diagnosis and prognosis in healthcare. However, most are not reliable for use in clinical practice. Main body We discuss how the creation of a prediction model (e.g. using regression or machine learning methods) is dependent on the sample and size of data used to develop it—were a different sample of the same size used from the same overarching population, the developed model could be very different even when the same model development methods are used. In other words, for each model created, there exists a multiverse of other potential models for that sample size and, crucially, an individual’s predicted value (e.g. estimated risk) may vary greatly across this multiverse. The more an individual’s prediction varies across the multiverse, the greater the instability. We show how small development datasets lead to more different models in the multiverse, often with vastly unstable individual predictions, and explain how this can be exposed by using bootstrapping and presenting instability plots. We recommend healthcare researchers seek to use large model development datasets to reduce instability concerns. This is especially important to ensure reliability across subgroups and improve model fairness in practice. Conclusions Instability is concerning as an individual’s predicted value is used to guide their counselling, resource prioritisation, and clinical decision making. If different samples lead to different models with very different predictions for the same individual, then this should cast doubt into using a particular model for that individual. Therefore, visualising, quantifying and reporting the instability in individual-level predictions is essential when proposing a new model.

Journal Article

Share this book

Add to My Shelf

Bootstrap-quantile ridge estimator for linear regression with applications

by Dar, Irum Sajjad , Chand, Sohail in Algorithms , Analysis , Biology and Life Sciences

2024

Bootstrap is a simple, yet powerful method of estimation based on the concept of random sampling with replacement. The ridge regression using a biasing parameter has become a viable alternative to the ordinary least square regression model for the analysis of data where predictors are collinear. This paper develops a nonparametric bootstrap-quantile approach for the estimation of ridge parameter in the linear regression model. The proposed method is illustrated using some popular and widely used ridge estimators, but this idea can be extended to any ridge estimator. Monte Carlo simulations are carried out to compare the performance of the proposed estimators with their baseline counterparts. It is demonstrated empirically that MSE obtained from our suggested bootstrap-quantile approach are substantially smaller than their baseline estimators especially when collinearity is high. Application to real data sets reveals the suitability of the idea.

Journal Article

Share this book

Add to My Shelf

Re-evaluation of the comparative effectiveness of bootstrap-based optimism correction methods in the development of multivariable clinical prediction models

by Iba, Katsuhiro , Maruo, Kazushi , Noma, Hisashi in Algorithms , Bias , Bootstrap

2021

Background Multivariable prediction models are important statistical tools for providing synthetic diagnosis and prognostic algorithms based on patients’ multiple characteristics. Their apparent measures for predictive accuracy usually have overestimation biases (known as ‘optimism’) relative to the actual performances for external populations. Existing statistical evidence and guidelines suggest that three bootstrap-based bias correction methods are preferable in practice, namely Harrell’s bias correction and the .632 and .632+ estimators. Although Harrell’s method has been widely adopted in clinical studies, simulation-based evidence indicates that the .632+ estimator may perform better than the other two methods. However, these methods’ actual comparative effectiveness is still unclear due to limited numerical evidence. Methods We conducted extensive simulation studies to compare the effectiveness of these three bootstrapping methods, particularly using various model building strategies: conventional logistic regression, stepwise variable selections, Firth’s penalized likelihood method, ridge, lasso, and elastic-net regression. We generated the simulation data based on the Global Utilization of Streptokinase and Tissue plasminogen activator for Occluded coronary arteries (GUSTO-I) trial Western dataset and considered how event per variable, event fraction, number of candidate predictors, and the regression coefficients of the predictors impacted the performances. The internal validity of C -statistics was evaluated. Results Under relatively large sample settings (roughly, events per variable ≥ 10), the three bootstrap-based methods were comparable and performed well. However, all three methods had biases under small sample settings, and the directions and sizes of biases were inconsistent. In general, Harrell’s and .632 methods had overestimation biases when event fraction become lager. Besides, .632+ method had a slight underestimation bias when event fraction was very small. Although the bias of the .632+ estimator was relatively small, its root mean squared error (RMSE) was comparable or sometimes larger than those of the other two methods, especially for the regularized estimation methods. Conclusions In general, the three bootstrap estimators were comparable, but the .632+ estimator performed relatively well under small sample settings, except when the regularized estimation methods are adopted.

Journal Article

Share this book

Add to My Shelf

Exchangeably Weighted Bootstraps of General Markov IU/I-Process

by Bouzebda, Salim , Soukarieh, Inass in Bootstrapping (Statistics) , Markov processes , Mathematical research

2022

We explore an exchangeably weighted bootstrap of the general function-indexed empirical U-processes in the Markov setting, which is a natural higher-order generalization of the weighted bootstrap empirical processes. As a result of our findings, a considerable variety of bootstrap resampling strategies arise. This paper aims to provide theoretical justifications for the exchangeably weighted bootstrap consistency in the Markov setup. General structural conditions on the classes of functions (possibly unbounded) and the underlying distributions are required to establish our results. This paper provides the first general theoretical study of the bootstrap of the empirical U-processes in the Markov setting. Potential applications include the symmetry test, Kendall's tau and the test of independence.

Journal Article

Share this book

Add to My Shelf

Periodically correlated time series and the Variable Bandpass Periodic Block Bootstrap

by Valachovic, Edward L. in Algorithms , Analysis , Bandpass filters

2024

This research introduces a novel approach to resampling periodically correlated time series using bandpass filters for frequency separation called the Variable Bandpass Periodic Block Bootstrap and then examines the significant advantages of this new method. While bootstrapping allows estimation of a statistic’s sampling distribution by resampling the original data with replacement, and block bootstrapping is a model-free resampling strategy for correlated time series data, both fail to preserve correlations in periodically correlated time series. Existing extensions of the block bootstrap help preserve the correlation structures of periodically correlated processes but suffer from flaws and inefficiencies. Analyses of time series data containing cyclic, seasonal, or periodically correlated principal components often seen in annual, daily, or other cyclostationary processes benefit from separating these components. The Variable Bandpass Periodic Block Bootstrap uses bandpass filters to separate a periodically correlated component from interference such as noise at other uncorrelated frequencies. A simulation study is presented, demonstrating near universal improvements obtained from the Variable Bandpass Periodic Block Bootstrap when compared with prior block bootstrapping methods for periodically correlated time series.

Journal Article

Share this book

Add to My Shelf

Hi-LASSO: High-performance python and apache spark packages for feature selection with high-dimensional data

by Jung, Seungha , Park, Joongyang , Kang, Mingon in Analysis , APACHE , Biology and Life Sciences

2022

High-dimensional LASSO (Hi-LASSO) is a powerful feature selection tool for high-dimensional data. Our previous study showed that Hi-LASSO outperformed the other state-of-the-art LASSO methods. However, the substantial cost of bootstrapping and the lack of experiments for a parametric statistical test for feature selection have impeded to apply Hi-LASSO for practical applications. In this paper, the Python package and its Spark library are efficiently designed in a parallel manner for practice with real-world problems, as well as providing the capability of the parametric statistical tests for feature selection on high-dimensional data. We demonstrate Hi-LASSO’s outperformance with various intensive experiments in a practical manner. Hi-LASSO will be efficiently and easily performed by using the packages for feature selection. Hi-LASSO packages are publicly available at https://github.com/datax-lab/Hi-LASSO under the MIT license. The packages can be easily installed by Python PIP, and additional documentation is available at https://pypi.org/project/hi-lasso and https://pypi.org/project/Hi-LASSO-spark .

Journal Article

Share this book

Add to My Shelf

Which method is optimal for estimating variance components and their variability in generalizability theory? evidence form a set of unified rules for bootstrap method

by Li, Guangming in Analysis , Bootstrapping (Statistics) , Confidence intervals

2023

The purpose of this study is to compare the performance of the four estimation methods (traditional method, jackknife method, bootstrap method, and MCMC method), find the optimal one, and make a set of unified rules for Bootstrap. Based on four types of simulated data (normal, dichotomous, polytomous, and skewed data), this study estimates and compares the estimated variance components and their variability of the four estimation methods when using a p×i design in generalizability theory. The estimated variance components are vc.p, vc.i and vc.pi and the variability of estimated variance components are their estimated standard errors (SE(vc.p), SE(vc.i) and SE(vc.pi)) and confidence intervals (CI(vc.p), CI(vc.i) and CI(vc.pi)). For the normal data, all the four methods can accurately estimate the variance components and their variability. For the dichotomous data, the |RPB| of SE (vc.i) of traditional method is 128.5714, the |RPB| of SE (vc.i), SE (vc.pi) and CI (vc.i) of jackknife method are 42.8571, 43.6893 and 40.5000, which are larger than 25 and not accurate. For the polytomous data, the |RPB| of SE (vc.i) and CI (vc.i) of MCMC method are 59.6612 and 45.2500, which are larger than 25 and not accurate. For the skewed data, the |RPB| of SE (vc.p), SE (vc.i) and SE (vc. pi) of traditional method and MCMC method are over 25, which are not accurate. Only the bootstrap method can estimate variance components and their variability accurately across different data distribution. Nonetheless, the divide-and-conquer strategy must be used when adopting the bootstrap method. The bootstrap method is optimal among the four methods and shows the cross-distribution superiority over the other three methods. However, a set of unified rules for the divide-and-conquer strategy need to be recommended for the bootstrap method, which is optimal when boot-p for p (person), boot-pi for i (item), and boot-i for pi (person × item).

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter