Catalogue Search | MBRL

The elements of statistical learning : data mining, inference, and prediction

by Hastie, Trevor, author , Tibshirani, Robert, author , Friedman, J. H. (Jerome H.), author in Supervised learning (Machine learning) , Electronic data processing. , Statistics.

Book

Share this book

Add to My Shelf

$Data Science in Statistics Curricula: Preparing Students to \Think with Data\$

Data Science in Statistics Curricula: Preparing Students to \Think with Data\

by Temple Lang, D. , Hardin, J. , Ward, M. D. in Big data; Computational statistics; Statistical practice; Statistics education , Case studies , College students

2015

A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to use databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes. The goal of this article is to motivate the importance of data science proficiency and to provide examples and resources for instructors to implement data science in their own statistics curricula. We provide case studies from seven institutions. These varied approaches to teaching data science demonstrate curricular innovations to address new needs. Also included here are examples of assignments designed for courses that foster engagement of undergraduates with data and data science. [Received November 2014. Revised July 2015.]

Journal Article

Share this book

Add to My Shelf

Computational statistics

1992

Journal

Share this book

Add to My Shelf

Probabilistic Integration

by Oates, Chris J. , Osborne, Michael A. , Girolami, Mark in Computation , Computer graphics , Discretization

2019

A research frontier has emerged in scientific computation, wherein discretisation error is regarded as a source of epistemic uncertainty that can be modelled. This raises several statistical challenges, including the design of statistical methods that enable the coherent propagation of probabilities through a (possibly deterministic) computational work-flow, in order to assess the impact of discretisation error on the computer output. This paper examines the case for probabilistic numerical methods in routine statistical computation. Our focus is on numerical integration, where a probabilistic integrator is equipped with a full distribution over its output that reflects the fact that the integrand has been discretised. Our main technical contribution is to establish, for the first time, rates of posterior contraction for one such method. Several substantial applications are provided for illustration and critical evaluation, including examples from statistical modelling, computer graphics and a computer model for an oil reservoir.

Journal Article

Share this book

Add to My Shelf

Regression shrinkage and selection via the lasso: a retrospective

by Tibshirani, Robert in Algorithms , Approximation , Computational methods

2011

In the paper I give a brief review of the basic idea and some history and then discuss some developments since the original paper on regression shrinkage and selection via the lasso. [PUBLICATION ABSTRACT]

Journal Article

Share this book

Add to My Shelf

Logistic Regression: From Art to Science

by Bertsimas, Dimitris , King, Angela in Mixed integer , Optimization , Properties (attributes)

2017

A high quality logistic regression model contains various desirable properties: predictive power, interpretability, significance, robustness to error in data and sparsity, among others. To achieve these competing goals, modelers incorporate these properties iteratively as they hone in on a final model. In the period 1991–2015, algorithmic advances in Mixed-Integer Linear Optimization (MILO) coupled with hardware improvements have resulted in an astonishing 450 billion factor speedup in solving MILO problems. Motivated by this speedup, we propose modeling logistic regression problems algorithmically with a mixed integer nonlinear optimization (MINLO) approach in order to explicitly incorporate these properties in a joint, rather than sequential, fashion. The resulting MINLO is flexible and can be adjusted based on the needs of the modeler. Using both real and synthetic data, we demonstrate that the overall approach is generally applicable and provides high quality solutions in realistic timelines as well as a guarantee of suboptimality. When the MINLO is infeasible, we obtain a guarantee that imposing distinct statistical properties is simply not feasible.

Journal Article

Share this book

Add to My Shelf

Second-generation PLINK: rising to the challenge of larger and richer datasets

by Purcell, Shaun M , Lee, James J , Tellier, Laurent CAM in Algorithms , Bioinformatics , Biomedical and Life Sciences

2015

Abstract Background PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. Findings To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, (n) -time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). Conclusions The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Journal Article

Share this book

Add to My Shelf

Journal of computational and graphical statistics

by Institute of Mathematical Statistics , American Statistical Association , Interface Foundation of North America

1992

Journal

Share this book

Add to My Shelf

OPTIMAL COMPUTATIONAL AND STATISTICAL RATES OF CONVERGENCE FOR SPARSE NONCONVEX LEARNING PROBLEMS

by Wang, Zhaoran , Zhang, Tong , Liu, Han in 62F30 , 62J12 , 90C26

2014

We provide theoretical analysis of the statistical and computational properties of penalized M-estimators that can be formulated as the solution to a possibly nonconvex optimization problem. Many important estimators fall in this category, including least squares regression with nonconvex regularization, generalized linear models with nonconvex regularization and sparse elliptical random design regression. For these problems, it is intractable to calculate the global solution due to the nonconvex formulation. In this paper, we propose an approximate regularization path-following method for solving a variety of learning problems with nonconvex objective functions. Under a unified analytic framework, we simultaneously provide explicit statistical and computational rates of convergence for any local solution attained by the algorithm. Computationally, our algorithm attains a global geometric rate of convergence for calculating the full regularization path, which is optimal among all first-order algorithms. Unlike most existing methods that only attain geometric rates of convergence for one single regularization parameter, our algorithm calculates the full regularization path with the same iteration complexity. In particular, we provide a refined iteration complexity bound to sharply characterize the performance of each stage along the regularization path. Statistically, we provide sharp sample complexity analysis for all the approximate local solutions along the regularization path. In particular, our analysis improves upon existing results by providing a more refined sample complexity bound as well as an exact support recovery result for the final estimator. These results show that the final estimator attains an oracle statistical property due to the usage of nonconvex penalty.

Journal Article

Share this book

Add to My Shelf

New Weighted Portmanteau Statistics for Time Series Goodness of Fit Testing

by Gallagher, Colin M. , Fisher, Thomas J. in ARMA model , Autocorrelation , Autoregressive moving average

2012

We exploit ideas from high-dimensional data analysis to derive new portmanteau tests that are based on the trace of the square of the mth order autocorrelation matrix. The resulting statistics are weighted sums of the squares of the sample autocorrelation coefficients that, unlike many other tests appearing in the literature, are numerically stable even when the number of lags considered is relatively close to the sample size. The statistics behave asymptotically as a linear combination of chi-squared random variables and their asymptotic distribution can be approximated by a gamma distribution. The proposed tests are modified to check for nonlinearity and to check the adequacy of a fitted nonlinear model. Simulation evidence indicates that the proposed goodness of fit tests tend to have higher power than other tests appearing in the literature, particularly in detecting long-memory nonlinear models. The efficacy of the proposed methods is demonstrated by investigating nonlinear effects in Apple, Inc., and Nikkei-300 daily returns during the 2006-2007 calendar years. The supplementary materials for this article are available online.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter