Catalogue Search | MBRL

Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons

by Hastie, Trevor , Tibshirani, Robert , Tibshirani, Ryan in Algorithms , Empirical analysis , Integer programming

2020

In exciting recent work, Bertsimas, King and Mazumder (Ann. Statist. 44 (2016) 813–852) showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem. Using recent advances in MIO algorithms, they demonstrated that best subset selection can now be solved at much larger problem sizes than what was thought possible in the statistics community. They presented empirical comparisons of best subset with other popular variable selection procedures, in particular, the lasso and forward stepwise selection. Surprisingly (to us), their simulations suggested that best subset consistently outperformed both methods in terms of prediction accuracy. Here, we present an expanded set of simulations to shed more light on these comparisons. The summary is roughly as follows: • neither best subset nor the lasso uniformly dominate the other, with best subset generally performing better in very high signal-to-noise (SNR) ratio regimes, and the lasso better in low SNR regimes; • for a large proportion of the settings considered, best subset and forward stepwise perform similarly, but in certain cases in the high SNR regime, best subset performs better; • forward stepwise and best subsets tend to yield sparser models (when tuned on a validation set), especially in the high SNR regime; • the relaxed lasso (actually, a simplified version of the original relaxed estimator defined in Meinshausen (Comput. Statist. Data Anal. 52 (2007) 374–393)) is the overall winner, performing just about as well as the lasso in low SNR scenarios, and nearly as well as best subset in high SNR scenarios.

Journal Article

Share this book

Add to My Shelf

ADAPTIVE PIECEWISE POLYNOMIAL ESTIMATION VIA TREND FILTERING

by Tibshirani, Ryan J. in 62G08 , 62G20 , Algorithms

2014

We study trend filtering, a recently proposed tool of Kim et al. [SIAM Rev. 51 (2009) 339-360] for nonparametric regression. The trend filtering estimate is defined as the minimizer of a penalized least squares criterion, in which the penalty term sums the absolute kth order discrete derivatives over the input points. Perhaps not surprisingly, trend filtering estimates appear to have the structure of kth degree spline functions, with adaptively chosen knot points (we say \"appear\" here as trend filtering estimates are not really functions over continuous domains, and are only defined over the discrete set of inputs). This brings to mind comparisons to other nonparametric regression tools that also produce adaptive splines; in particular, we compare trend filtering to smoothing splines, which penalize the sum of squared derivatives across input points, and to locally adaptive regression splines [Ann. Statist. 25 (1997) 387-413], which penalize the total variation of the kth derivative. Empirically, we discover that trend filtering estimates adapt to the local level of smoothness much better than smoothing splines, and further, they exhibit a remarkable similarity to locally adaptive regression splines. We also provide theoretical support for these empirical findings; most notably, we prove that (with the right choice of tuning parameter) the trend filtering estimate converges to the true underlying function at the minimax rate for functions whose kth derivative is of bounded variation. This is done via an asymptotic pairing of trend filtering and locally adaptive regression splines, which have already been shown to converge at the minimax rate [Ann. Statist. 25 (1997) 387-413]. At the core of this argument is a new result tying together the fitted values of two lasso problems that share the same outcome vector, but have different predictor matrices.

Journal Article

Share this book

Add to My Shelf

Distribution-Free Predictive Inference for Regression

by G'Sell, Max , Wasserman, Larry , Tibshirani, Ryan J. in Computational efficiency , computer software , Computing time

2018

We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows for the construction of a prediction band for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guaranteeing finite-sample marginal coverage even when these assumptions do not hold. We analyze and compare, both empirically and theoretically, the two major variants of our conformal framework: full conformal inference and split conformal inference, along with a related jackknife method. These methods offer different tradeoffs between statistical accuracy (length of resulting prediction intervals) and computational efficiency. As extensions, we develop a method for constructing valid in-sample prediction intervals called rank-one-out conformal inference, which has essentially the same computational efficiency as split conformal inference. We also describe an extension of our procedures for producing prediction bands with locally varying length, to adapt to heteroscedasticity in the data. Finally, we propose a model-free notion of variable importance, called leave-one-covariate-out or LOCO inference. Accompanying this article is an R package conformalInference that implements all of the proposals we have introduced. In the spirit of reproducibility, all of our empirical results can also be easily (re)generated using this package.

Journal Article

Share this book

Add to My Shelf

Exact Post-Selection Inference for Sequential Regression Procedures

by Tibshirani, Robert , Taylor, Jonathan , Tibshirani, Ryan J. in Confidence interval , Confidence intervals , equations

2016

We propose new inference tools for forward stepwise regression, least angle regression, and the lasso. Assuming a Gaussian model for the observation vector y, we first describe a general scheme to perform valid inference after any selection event that can be characterized as y falling into a polyhedral set. This framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path, because, as it turns out, selection events for these procedures can be expressed as polyhedral constraints on y. The p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact Type I error control. The tests can also be inverted to produce confidence intervals for appropriate underlying regression parameters. The R package selectiveInference , freely available on the CRAN repository, implements the new inference tools described in this article. Supplementary materials for this article are available online.

Journal Article

Share this book

Add to My Shelf

PREDICTIVE INFERENCE WITH THE JACKKNIFE

by Ramdas, Aaditya , Candès, Emmanuel J. , Barber, Rina Foygel in Algorithms , Confidence intervals , Data points

2021

This paper introduces the jackknife+, which is a novel method for constructing predictive confidence intervals. Whereas the jackknife outputs an interval centered at the predicted response of a test point, with the width of the interval determined by the quantiles of leave-one-out residuals, the jackknife+ also uses the leave-one-out predictions at the test point to account for the variability in the fitted regression function. Assuming exchangeable training samples, we prove that this crucial modification permits rigorous coverage guarantees regardless of the distribution of the data points, for any algorithm that treats the training points symmetrically. Such guarantees are not possible for the original jackknife and we demonstrate examples where the coverage rate may actually vanish. Our theoretical and empirical analysis reveals that the jackknife and the jackknife+ intervals achieve nearly exact coverage and have similar lengths whenever the fitting algorithm obeys some form of stability. Further, we extend the jackknife+ to K-fold cross validation and similarly establish rigorous coverage properties. Our methods are related to cross-conformal prediction proposed by Vovk (Ann. Math. Artif. Intell. 74 (2015) 9–28) and we discuss connections.

Journal Article

Share this book

Add to My Shelf

A SIGNIFICANCE TEST FOR THE LASSO

by Tibshirani, Robert , Taylor, Jonathan , Tibshirani, Ryan J. in 62F03 , 62J05 , 62J07

2014

In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p>n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a $\\chi _1^2$ distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than $\\chi _1^2$ under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the l₁ penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties—adaptivity and shrinkage—and its null distribution is tractable and asymptotically Exp(1).

Journal Article

Share this book

Add to My Shelf

UNIFORM ASYMPTOTIC INFERENCE AND THE BOOTSTRAP AFTER MODEL SELECTION

by Wasserman, Larry , Tibshirani, Ryan J. , Tibshirani, Rob in Asymptotic methods , Asymptotic properties , Normality

2018

Recently, Tibshirani et al. [J. Amer. Statist. Assoc. 111 (2016) 600–620] proposed a method for making inferences about parameters defined by model selection, in a typical regression setting with normally distributed errors. Here, we study the large sample properties of this method, without assuming normality. We prove that the test statistic of Tibshirani et al. (2016) is asymptotically valid, as the number of samples n grows and the dimension d of the regression problem stays fixed. Our asymptotic result holds uniformly over a wide class of nonnormal error distributions. We also propose an efficient bootstrap version of this test that is provably (asymptotically) conservative, and in practice, often delivers shorter intervals than those from the original normality-based approach. Finally, we prove that the test statistic of Tibshirani et al. (2016) does not enjoy uniform validity in a high-dimensional setting, when the dimension d is allowed grow.

Journal Article

Share this book

Add to My Shelf

DEGREES OF FREEDOM IN LASSO PROBLEMS

by Tibshirani, Ryan J. , Taylor, Jonathan in 62J07 , 90C46 , Analytical estimating

2012

We derive the degrees of freedom of the lasso fit, placing no assumptions on the predictor matrix X. Like the well-known result of Zou, Hastie and Tibshirani [Ann. Statist. 35 (2007) 2173-2192], which gives the degrees of freedom of the lasso fit when X has full column rank, we express our result in terms of the active set of a lasso solution. We extend this result to cover the degrees of freedom of the generalized lasso fit for an arbitrary predictor matrix X (and an arbitrary penalty matrix D). Though our focus is degrees of freedom, we establish some intermediate results on the lasso and generalized lasso that may be interesting on their own.

Journal Article

Share this book

Add to My Shelf

THE SOLUTION PATH OF THE GENERALIZED LASSO

by Tibshirani, Ryan J. , Taylor, Jonathan in 62-XX , Algorithms , Approximation

2011

We present a path algorithm for the generalized lasso problem. This problem penalizes the ℓ 1 norm of a matrix D times the coefficient vector, and has a wide range of applications, dictated by the choice of D. Our algorithm is based on solving the dual of the generalized lasso, which greatly facilitates computation of the path. For D = I (the usual lasso), we draw a connection between our approach and the well-known LARS algorithm. For an arbitrary D, we derive an unbiased estimate of the degrees of freedom of the generalized lasso fit. This estimate turns out to be quite intuitive in many applications.

Journal Article

Share this book

Add to My Shelf