Catalogue Search | MBRL

An overview of the estimation of large covariance and precision matrices

by Liu, Han , Fan, Jianqing , Liao, Yuan in Analysis of covariance , Approximate factor model , Econometrics

2016

The estimation of large covariance and precision matrices is fundamental in modern multivariate analysis. However, problems arise from the statistical analysis of large panel economic and financial data. The covariance matrix reveals marginal correlations between variables, while the precision matrix encodes conditional correlations between pairs of variables given the remaining variables. In this paper, we provide a selective review of several recent developments on the estimation of large covariance and precision matrices. We focus on two general approaches: a rank-based method and a factor-model-based method. Theories and applications of both approaches are presented. These methods are expected to be widely applicable to the analysis of economic and financial data.

Journal Article

Share this book

Add to My Shelf

JUST INTERPOLATE

by Liang, Tengyuan , Rakhlin, Alexander in Asymptotic methods , Covariance , Datasets

2020

In the absence of explicit regularization, Kernel “Ridgeless” Regression with nonlinear kernels has the potential to fit the training data perfectly. It has been observed empirically, however, that such interpolated solutions can still generalize well on test data. We isolate a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. In addition to deriving a data-dependent upper bound on the out-of-sample error, we present experimental evidence suggesting that the phenomenon occurs in the MNIST dataset.

Journal Article

Share this book

Add to My Shelf

A review of feature selection methods in medical applications

by Remeseiro, Beatriz , Bolon-Canedo, Veronica in Artificial intelligence , Biomedical data , Data analysis

2019

Feature selection is a preprocessing technique that identifies the key features of a given problem. It has traditionally been applied in a wide range of problems that include biological data processing, finance, and intrusion detection systems. In particular, feature selection has been successfully used in medical applications, where it can not only reduce dimensionality but also help us understand the causes of a disease. We describe some basic concepts related to medical applications and provide some necessary background information on feature selection. We review the most recent feature selection methods developed for and applied in medical problems, covering prolific research fields such as medical imaging, biomedical signal processing, and DNA microarray data analysis. A case study of two medical applications that includes actual patient data is used to demonstrate the suitability of applying feature selection methods in medical problems and to illustrate how these methods work in real-world scenarios. General steps of the feature selection approaches and their main benefits. [Display omitted] •A survey on feature selection methods developed and/or applied to medical applications.•Background information for researchers who are not familiar enough with certain terms.•A case study of two medical applications to demonstrate the adequacy of feature selection in this domain.

Journal Article

Share this book

Add to My Shelf

A hybrid clustering and boosting tree feature selection (CBTFS) method for credit risk assessment with high-dimensionality

by Zhang, Xiaoming , Yu, Lean , Wu, Xiong in credit risk , feature selection , high-dimensionality

2025

To solve the high-dimensional issue in credit risk assessment, a hybrid clustering and boosting tree feature selection method is proposed. In the hybrid methodology, an improved minimum spanning tree model is first used to remove redundant and irrelevant features. Then three embedded feature selection approaches (i.e., Random Forest, XGBoost, and AdaBoost) are used to further enhance the feature-ranking efficiency and obtain better prediction performance by applying the optimal features. For verification purpose, two real-world credit datasets are used to demonstrate the effectiveness of the proposed hybrid clustering and boosting tree feature selection (CBTFS) methodology. Experimental results demonstrated that the proposed method is superior to others classic feature selection methods. This indicates that the proposed hybrid clustering and boosting tree feature selection method can be used as a promising tool for solving high-dimensional issue in credit risk assessment. First published online 12 February 2025

Journal Article

Share this book

Add to My Shelf

GAUSSIAN APPROXIMATIONS AND MULTIPLIER BOOTSTRAP FOR MAXIMA OF SUMS OF HIGH-DIMENSIONAL RANDOM VECTORS

by Kato, Kengo , Chetverikov, Denis , Chernozhukov, Victor in 62E17 , 62F40 , anti-concentration

2013

We derive a Gaussian approximation result for the maximum of a sum of high-dimensional random vectors. Specifically, we establish conditions under which the distribution of the maximum is approximated by that of the maximum of a sum of the Gaussian random vectors with the same covariance matrices as the original vectors. This result applies when the dimension of random vectors (p) is large compared to the sample size (n); in fact, p can be much larger than n, without restricting correlations of the coordinates of these vectors. We also show that the distribution of the maximum of a sum of the random vectors with unknown covariance matrices can be consistently estimated by the distribution of the maximum of a sum of the conditional Gaussian random vectors obtained by multiplying the original vectors with i.i.d. Gaussian multipliers. This is the Gaussian multiplier (or wild) bootstrap procedure. Here too, p can be large or even much larger than n. These distributional approximations, either Gaussian or conditional Gaussian, yield a high-quality approximation to the distribution of the original maximum, often with approximation error decreasing polynomially in the sample size, and hence are of interest in many applications. We demonstrate how our Gaussian approximations and the multiplier bootstrap can be used for modern highdimensional estimation, multiple hypothesis testing, and adaptive specification testing. All these results contain nonasymptotic bounds on approximation errors.

Journal Article

Share this book

Add to My Shelf

DISTANCE-BASED AND RKHS-BASED DEPENDENCE METRICS IN HIGH DIMENSION

by Zhang, Xianyang , Zhu, Changbo , Yao, Shun in Agglomeration , Covariance , Data processing

2020

In this paper, we study distance covariance, Hilbert–Schmidt covariance (aka Hilbert–Schmidt independence criterion [In Advances in Neural Information Processing Systems (2008) 585–592]) and related independence tests under the high dimensional scenario. We show that the sample distance/Hilbert–Schmidt covariance between two random vectors can be approximated by the sum of squared componentwise sample cross-covariances up to an asymptotically constant factor, which indicates that the standard distance/Hilbert–Schmidt covariance based test can only capture linear dependence in high dimension. Under the assumption that the components within each high dimensional vector are weakly dependent, the distance correlation based t test developed by Székely and Rizzo (J. Multivariate Anal. 117 (2013) 193–213) for independence is shown to have trivial limiting power when the two random vectors are nonlinearly dependent but component-wisely uncorrelated. This new and surprising phenomenon, which seems to be discovered and carefully studied for the first time, is further confirmed in our simulation study. As a remedy, we propose tests based on an aggregation of marginal sample distance/Hilbert–Schmidt covariances and show their superior power behavior against their joint counterparts in simulations. We further extend the distance correlation based t test to those based on Hilbert–Schmidt covariance and marginal distance/Hilbert–Schmidt covariance. A novel unified approach is developed to analyze the studentized sample distance/Hilbert–Schmidt covariance as well as the studentized sample marginal distance covariance under both null and alternative hypothesis. Our theoretical and simulation results shed light on the limitation of distance/Hilbert–Schmidt covariance when used jointly in the high dimensional setting and suggest the aggregation of marginal distance/Hilbert–Schmidt covariance as a useful alternative.

Journal Article

Share this book

Add to My Shelf

Feature selection for high-dimensional classification using a competitive swarm optimizer

by Gu, Shenkai , Jin, Yaochu , Cheng, Ran in Algorithms , Artificial Intelligence , Classification

2018

When solving many machine learning problems such as classification, there exists a large number of input features. However, not all features are relevant for solving the problem, and sometimes, including irrelevant features may deteriorate the learning performance.Please check the edit made in the article title Therefore, it is essential to select the most relevant features, which is known as feature selection. Many feature selection algorithms have been developed, including evolutionary algorithms or particle swarm optimization (PSO) algorithms, to find a subset of the most important features for accomplishing a particular machine learning task. However, the traditional PSO does not perform well for large-scale optimization problems, which degrades the effectiveness of PSO for feature selection when the number of features dramatically increases. In this paper, we propose to use a very recent PSO variant, known as competitive swarm optimizer (CSO) that was dedicated to large-scale optimization, for solving high-dimensional feature selection problems. In addition, the CSO, which was originally developed for continuous optimization, is adapted to perform feature selection that can be considered as a combinatorial optimization problem. An archive technique is also introduced to reduce computational cost. Experiments on six benchmark datasets demonstrate that compared to the canonical PSO-based and a state-of-the-art PSO variant for feature selection, the proposed CSO-based feature selection algorithm not only selects a much smaller number of features, but result in better classification performance as well.

Journal Article

Share this book

Add to My Shelf

Feature selection in image analysis: a survey

by Remeseiro Beatriz , Bolón-Canedo Verónica in Annotations , Artificial intelligence , Big Data

2020

Image analysis is a prolific field of research which has been broadly studied in the last decades, successfully applied to a great number of disciplines. Since the apparition of Big Data, the number of digital images is explosively growing, and a large amount of multimedia data is publicly available. Not only is it necessary to deal with this increasing number of images, but also to know which features extract from them, and feature selection can help in this scenario. The goal of this paper is to survey the most recent feature selection methods developed and/or applied to image analysis, covering the most popular fields such as image classification, image segmentation, etc. Finally, an experimental evaluation on several popular datasets using well-known feature selection methods is presented, bearing in mind that the aim is not to provide the best feature selection method, but to facilitate comparative studies for the research community.

Journal Article

Share this book

Add to My Shelf

A comprehensive survey of anomaly detection techniques for high dimensional big data

by Branch, Philip , Thudumu, Srikanth , Jin, Jiong in Accuracy , Algorithms , Anomalies

2020

Anomaly detection in high dimensional data is becoming a fundamental research problem that has various applications in the real world. However, many existing anomaly detection techniques fail to retain sufficient accuracy due to so-called “big data” characterised by high-volume, and high-velocity data generated by variety of sources. This phenomenon of having both problems together can be referred to the “curse of big dimensionality,” that affect existing techniques in terms of both performance and accuracy. To address this gap and to understand the core problem, it is necessary to identify the unique challenges brought by the anomaly detection with both high dimensionality and big data problems. Hence, this survey aims to document the state of anomaly detection in high dimensional big data by representing the unique challenges using a triangular model of vertices: the problem (big dimensionality), techniques/algorithms (anomaly detection), and tools (big data applications/frameworks). Authors’ work that fall directly into any of the vertices or closely related to them are taken into consideration for review. Furthermore, the limitations of traditional approaches and current strategies of high dimensional data are discussed along with recent techniques and applications on big data required for the optimization of anomaly detection.

Journal Article

Share this book

Add to My Shelf

Large covariance estimation by thresholding principal orthogonal complements

by Fan, Jianqing , Mincheva, Martina , Liao, Yuan in Analysis of covariance , Approximate factor model , Approximation

2013

The paper deals with the estimation of a high dimensional covariance with a conditional sparsity structure and fast diverging eigenvalues. By assuming a sparse error covariance matrix in an approximate factor model, we allow for the presence of some cross-sectional correlation even after taking out common but unobservable factors. We introduce the principal orthogonal complement thresholding method 'POET' to explore such an approximate factor structure with sparsity. The POET-estimator includes the sample covariance matrix, the factor-based covariance matrix, the thresholding estimator and the adaptive thresholding estimator as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high dimensional data. The rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms. It is shown that the effect of estimating the unknown factors vanishes as the dimensionality increases. The uniform rates of convergence for the unobserved factors and their factor loadings are derived. The asymptotic results are also verified by extensive simulation studies. Finally, a real data application on portfolio allocation is presented.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter