Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
2,247
result(s) for
"Inference after selection"
Sort by:
Exact Post-Selection Inference for Sequential Regression Procedures
by
Tibshirani, Robert
,
Taylor, Jonathan
,
Tibshirani, Ryan J.
in
Confidence interval
,
Confidence intervals
,
equations
2016
We propose new inference tools for forward stepwise regression, least angle regression, and the lasso. Assuming a Gaussian model for the observation vector y, we first describe a general scheme to perform valid inference after any selection event that can be characterized as y falling into a polyhedral set. This framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path, because, as it turns out, selection events for these procedures can be expressed as polyhedral constraints on y. The p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact Type I error control. The tests can also be inverted to produce confidence intervals for appropriate underlying regression parameters. The R package
selectiveInference
, freely available on the CRAN repository, implements the new inference tools described in this article. Supplementary materials for this article are available online.
Journal Article
Parsimonious model selection using information theory
by
Brook, Barry W.
,
Richards, Shane A.
,
Yates, Luke A.
in
Bayesian analysis
,
Bayesian theory
,
Criteria
2021
Information-theoretic approaches to model selection, such as Akaike’s information criterion (AIC) and cross validation, provide a rigorous framework to select among candidate hypotheses in ecology, yet the persistent concern of overfitting undermines the interpretation of inferred processes. A common misconception is that overfitting is due to the choice of criterion or model score, despite research demonstrating that selection uncertainty associated with score estimation is the predominant influence. Here we introduce a novel selection rule that identifies a parsimonious model by directly accounting for estimation uncertainty, while still retaining an information-theoretic interpretation. The new rule, which is a modification of the existing one-standard-error rule, mitigates overfitting and reduces the likelihood that spurious effects will be included in the selected model, thereby improving its inferential properties. We present the rule and illustrative examples in the context of maximum-likelihood estimation and Kullback-Leibler discrepancy, although the rule is applicable in a more general setting, including Bayesian model selection and other types of discrepancy.
Journal Article
Sharp Simultaneous Confidence Intervals for the Means of Selected Populations with Application to Microarray Data Analysis
2007
Simultaneous inference for a large number, N, of parameters is a challenge. In some situations, such as microarray experiments, researchers are only interested in making inference for the K parameters corresponding to the K most extreme estimates. Hence it seems important to construct simultaneous confidence intervals for these K parameters. The naive simultaneous confidence intervals for the K means (applied directly without taking into account the selection) have low coverage probabilities. We take an empirical Bayes approach (or an approach based on the random effect model) to construct simultaneous confidence intervals with good coverage probabilities. For N = 10,000 and K = 100, typical for microarray data, our confidence intervals could be 77% shorter than the naive K-dimensional simultaneous intervals.
Journal Article
Marginal Screening of 2 × 2 Tables in Large-Scale Case-Control Studies
2019
Assessing the statistical significance of risk factors when screening large numbers of 2 × 2 tables that cross-classify disease status with each type of exposure poses a challenging multiple testing problem. The problem is especially acute in large-scale genomic case-control studies. We develop a potentially more powerful and computationally efficient approach (compared with existing methods, including Bonferroni and permutation testing) by taking into account the presence of complex dependencies between the 2 x 2 tables. Our approach gains its power by exploiting Monte Carlo simulation from the estimated null distribution of a maximally selected log-odds ratio. We apply the method to case-control data from a study of a large collection of genetic variants related to the risk of early onset stroke.
Journal Article
UNIFORM ASYMPTOTIC INFERENCE AND THE BOOTSTRAP AFTER MODEL SELECTION
by
Wasserman, Larry
,
Tibshirani, Ryan J.
,
Tibshirani, Rob
in
Asymptotic methods
,
Asymptotic properties
,
Normality
2018
Recently, Tibshirani et al. [J. Amer. Statist. Assoc. 111 (2016) 600–620] proposed a method for making inferences about parameters defined by model selection, in a typical regression setting with normally distributed errors. Here, we study the large sample properties of this method, without assuming normality. We prove that the test statistic of Tibshirani et al. (2016) is asymptotically valid, as the number of samples n grows and the dimension d of the regression problem stays fixed. Our asymptotic result holds uniformly over a wide class of nonnormal error distributions. We also propose an efficient bootstrap version of this test that is provably (asymptotically) conservative, and in practice, often delivers shorter intervals than those from the original normality-based approach. Finally, we prove that the test statistic of Tibshirani et al. (2016) does not enjoy uniform validity in a high-dimensional setting, when the dimension d is allowed grow.
Journal Article
In Defense of the Indefensible
by
Shojaie, Ali
,
Witten, Daniela
,
Zhao, Sen
in
Confidence intervals
,
Least squares
,
Maximum likelihood method
2021
A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naïve two-step procedure for this task, in which we (i) fit a lasso model in order to obtain a subset of the variables, and (ii) fit a least squares model on the lasso-selected set. Conventional statistical wisdom tells us that we cannot make use of the standard statistical inference tools for the resulting least squares model (such as confidence intervals and 𝑝-values), since we peeked at the data twice: once in running the lasso, and again in fitting the least squares model. However, in this paper, we show that under a certain set of assumptions, with high probability, the set of variables selected by the lasso is identical to the one selected by the noiseless lasso and is hence deterministic. Consequently, the naïve two-step approach can yield asymptotically valid inference. We utilize this finding to develop the naïve confidence interval, which can be used to draw inference on the regression coefficients of the model selected by the lasso, as well as the naïve score test, which can be used to test the hypotheses regarding the full-model regression coefficients.
Journal Article
ROCKET
by
Kolar, Mladen
,
Barber, Rina Foygel
in
Complex variables
,
Computer simulation
,
Confidence intervals
2018
Understanding complex relationships between random variables is of fundamental importance in high-dimensional statistics, with numerous applications in biological and social sciences. Undirected graphical models are often used to represent dependencies between random variables, where an edge between two random variables is drawn if they are conditionally dependent given all the other measured variables. A large body of literature exists on methods that estimate the structure of an undirected graphical model, however, little is known about the distributional properties of the estimators beyond the Gaussian setting. In this paper, we focus on inference for edge parameters in a high-dimensional transelliptical model, which generalizes Gaussian and nonparanormal graphical models. We propose ROCKET, a novel procedure for estimating parameters in the latent inverse covariance matrix. We establish asymptotic normality of ROCKET in an ultra high-dimensional setting under mild assumptions, without relying on oracle model selection results. ROCKET requires the same number of samples that are known to be necessary for obtaining a √n consistent estimator of an element in the precision matrix under a Gaussian model. Hence, it is an optimal estimator under a much larger family of distributions. The result hinges on a tight control of the sparse spectral norm of the nonparametric Kendall’s tau estimator of the correlation matrix, which is of independent interest. Empirically, ROCKET outperforms the nonparanormal and Gaussian models in terms of achieving accurate inference on simulated data. We also compare the three methods on real data (daily stock returns), and find that the ROCKET estimator is the only method whose behavior across subsamples agrees with the distribution predicted by the theory.
Journal Article
Uniform post-selection inference for least absolute deviation regression and other Z-estimation problems
by
KATO, K.
,
BELLONI, A.
,
CHERNOZHUKOV, V.
in
Asymptotic methods
,
Decision making models
,
Estimating techniques
2015
We develop uniformly valid confidence regions for regression coefficients in a highdimensional sparse median regression model with homoscedastic errors. Our methods are based on a moment equation that is immunized against nonregular estimation of the nuisance part of the median regression function by using Neyman's orthogonalization. We establish that the resulting instrumental median regression estimator of a target regression coefficient is asymptotically normally distributed uniformly with respect to the underlying sparse model and is semiparametrically efficient. We also generalize our method to a general nonsmooth Z-estimation framework where the number of target parameters is possibly much larger than the sample size. We extend Huber's results on asymptotic normality to this setting, demonstrating uniform asymptotic normality of the proposed estimators over rectangles, constructing simultaneous confidence bands on all of the target parameters, and establishing asymptotic validity of the bands uniformly over underlying approximately sparse models.
Journal Article
Asymptotic post-selection inference for the Akaike information criterion
2018
Ignoring the model selection step in inference after selection is harmful. In this paper we study the asymptotic distribution of estimators after model selection using the Akaike information criterion. First, we consider the classical setting in which a true model exists and is included in the candidate set of models. We exploit the overselection property of this criterion in constructing a selection region, and we obtain the asymptotic distribution of estimators and linear combinations thereof conditional on the selected model. The limiting distribution depends on the set of competitive models and on the smallest overparameterized model. Second, we relax the assumption on the existence of a true model and obtain uniform asymptotic results. We use simulation to study the resulting post-selection distributions and to calculate confidence regions for the model parameters, and we also apply the method to a diabetes dataset.
Journal Article
WEAK SIGNAL IDENTIFICATION AND INFERENCE IN PENALIZED LIKELIHOOD MODELS FOR CATEGORICAL RESPONSES
2023
Penalized likelihood models are widely used to simultaneously select variables and estimate model parameters. However, the existence of weak signals can lead to inaccurate variable selection, biased parameter estimation, and invalid inference. Thus, identifying weak signals accurately and making valid inferences are crucial in penalized likelihood models. We develop a unified approach to identify weak signals and make inferences in penalized likelihood models, including the special case when the responses are categorical. To identify weak signals, we use the estimated selection probability of each covariate as a measure of the signal strength and formulate a signal identification criterion. To construct confidence intervals, we propose a two-step inference procedure. Extensive simulation studies show that the proposed procedure outperforms several existing methods. We illustrate the proposed method by applying it to the Practice Fusion diabetes data set.
Journal Article