Catalogue Search | MBRL

A comprehensive survey of anomaly detection techniques for high dimensional big data

by Branch, Philip , Thudumu, Srikanth , Jin, Jiong in Accuracy , Algorithms , Anomalies

2020

Anomaly detection in high dimensional data is becoming a fundamental research problem that has various applications in the real world. However, many existing anomaly detection techniques fail to retain sufficient accuracy due to so-called “big data” characterised by high-volume, and high-velocity data generated by variety of sources. This phenomenon of having both problems together can be referred to the “curse of big dimensionality,” that affect existing techniques in terms of both performance and accuracy. To address this gap and to understand the core problem, it is necessary to identify the unique challenges brought by the anomaly detection with both high dimensionality and big data problems. Hence, this survey aims to document the state of anomaly detection in high dimensional big data by representing the unique challenges using a triangular model of vertices: the problem (big dimensionality), techniques/algorithms (anomaly detection), and tools (big data applications/frameworks). Authors’ work that fall directly into any of the vertices or closely related to them are taken into consideration for review. Furthermore, the limitations of traditional approaches and current strategies of high dimensional data are discussed along with recent techniques and applications on big data required for the optimization of anomaly detection.

Journal Article

Share this book

Add to My Shelf

Simplifying the representation of complex free-energy landscapes using sketch-map

by Tribello, Gareth A , Ceriotti, Michele , Parrinello, Michele in Algorithms , Dimensionality , Dimensionality reduction

2011

A new scheme, sketch-map, for obtaining a low-dimensional representation of the region of phase space explored during an enhanced dynamics simulation is proposed. We show evidence, from an examination of the distribution of pairwise distances between frames, that some features of the free-energy surface are inherently high-dimensional. This makes dimensionality reduction problematic because the data does not satisfy the assumptions made in conventional manifold learning algorithms We therefore propose that when dimensionality reduction is performed on trajectory data one should think of the resultant embedding as a quickly sketched set of directions rather than a road map. In other words, the embedding tells one about the connectivity between states but does not provide the vectors that correspond to the slow degrees of freedom. This realization informs the development of sketch-map, which endeavors to reproduce the proximity information from the high-dimensionality description in a space of lower dimensionality even when a faithful embedding is not possible.

Journal Article

Share this book

Add to My Shelf

A SELECTIVE OVERVIEW OF VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE

by Lv, Jinchi , Fan, Jianqing in Algorithms , Consistent estimators , Dimensionality

2010

High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods.

Journal Article

Share this book

Add to My Shelf

Hierarchical Clustering and Dimensionality Reduction for SARS-CoV-2 Genome Analysis Across Highly Affected Nations

by V, Venkataramanan , M, Dillibabu , J, Srinivasan

2025

The global pandemic caused by the novel coronavirus SARS-CoV-2 has prompted extensive research into its genetic diversity to support drug development and vaccination strategies. In this study, we analyze the genetic similarity patterns of SARS-CoV-2 genome sequences from six severely affected nations: USA, Italy, Spain, France, Germany, and the UK. A total of 359 complete human host SARS-CoV-2 genome sequences, ranging from 29,538 to 29,987 base pairs, are processed using k-mer representation, with k = 2 (dinucleotides) and k = 3 (codons). These representations are converted into 50-dimensional feature vectors. To identify intrinsic patterns within this high-dimensional dataset, we apply agglomerative hierarchical clustering using average linkage. A Silhouette score of 0.48 and a Hopkins statistic of 0.85 indicate moderate clustering tendency and structure. Four primary clusters are identified, highlighting notable genomic similarities. Specifically, sequences from the USA, Spain, and Italy predominantly group together, suggesting shared genetic traits. To further aid interpretation, we apply dimensionality reduction techniques—Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE)—which project the high-dimensional feature vectors into 2-dimensional space. Visualizations confirm the clustering structure, with USA, Spain, and Italy forming a distinct and tight cluster, while sequences from France, Germany, and the UK show more dispersed patterns. This study provides a quantitative and visual understanding of SARS-CoV-2 genetic diversity across heavily impacted nations. The combination of k-mer-based feature encoding, hierarchical clustering, and dimensionality reduction offers actionable insights that may inform more targeted therapeutic and vaccine design strategies.

Journal Article

Share this book

Add to My Shelf

Demixed principal component analysis of neural population data

by Qi, Xue-Lian , Constantinidis, Christos , Romo, Ranulfo in Animals , Datasets as Topic , Decision Making - physiology

2016

Neurons in higher cortical areas, such as the prefrontal cortex, are often tuned to a variety of sensory and motor variables, and are therefore said to display mixed selectivity. This complexity of single neuron responses can obscure what information these areas represent and how it is represented. Here we demonstrate the advantages of a new dimensionality reduction technique, demixed principal component analysis (dPCA), that decomposes population activity into a few components. In addition to systematically capturing the majority of the variance of the data, dPCA also exposes the dependence of the neural representation on task parameters such as stimuli, decisions, or rewards. To illustrate our method we reanalyze population data from four datasets comprising different species, different cortical areas and different experimental tasks. In each case, dPCA provides a concise way of visualizing the data that summarizes the task-dependent features of the population response in a single figure. Many neuroscience experiments today involve using electrodes to record from the brain of an animal, such as a mouse or a monkey, while the animal performs a task. The goal of such experiments is to understand how a particular brain region works. However, modern experimental techniques allow the activity of hundreds of neurons to be recorded simultaneously. Analysing such large amounts of data then becomes a challenge in itself. This is particularly true for brain regions such as the prefrontal cortex that are involved in the cognitive processes that allow an animal to acquire knowledge. Individual neurons in the prefrontal cortex encode many different types of information relevant to a given task. Imagine, for example, that an animal has to select one of two objects to obtain a reward. The same group of prefrontal cortex neurons will encode the object presented to the animal, the animal’s decision and its confidence in that decision. This simultaneous representation of different elements of a task is called a ‘mixed’ representation, and is difficult to analyse. Kobak, Brendel et al. have now developed a data analysis tool that can ‘demix’ neural activity. The tool breaks down the activity of a population of neurons into its individual components. Each of these relates to only a single aspect of the task and is thus easier to interpret. Information about stimuli, for example, is distinguished from information about the animal’s confidence levels. Kobak, Brendel et al. used the demixing tool to reanalyse existing datasets recorded from several different animals, tasks and brain regions. In each case, the tool provided a complete, concise and transparent summary of the data. The next steps will be to apply the analysis tool to new datasets to see how well it performs in practice. At a technical level, the tool could also be extended in a number of different directions to enable it to deal with more complicated experimental designs in future.

Journal Article

Share this book

Add to My Shelf

ON THE RATE OF CONVERGENCE OF FULLY CONNECTED DEEP NEURAL NETWORK REGRESSION ESTIMATES

by Langer, Sophie , Kohler, Michael in Artificial neural networks , Computer architecture , Convergence

2021

Recent results in nonparametric regression show that deep learning, that is, neural network estimates with many hidden layers, are able to circumvent the so-called curse of dimensionality in case that suitable restrictions on the structure of the regression function hold. One key feature of the neural networks used in these results is that their network architecture has a further constraint, namely the network sparsity. In this paper, we show that we can get similar results also for least squares estimates based on simple fully connected neural networks with ReLU activation functions. Here, either the number of neurons per hidden layer is fixed and the number of hidden layers tends to infinity suitably fast for sample size tending to infinity, or the number of hidden layers is bounded by some logarithmic factor in the sample size and the number of neurons per hidden layer tends to infinity suitably fast for sample size tending to infinity. The proof is based on new approximation results concerning deep neural networks.

Journal Article

Share this book

Add to My Shelf

Autoencoders and their applications in machine learning: a survey

by Salehi, Elaheh Sadat , Daneshfar, Fatemeh , Berahmand, Kamal in Algorithms , Anomalies , Artificial Intelligence

2024

Autoencoders have become a hot researched topic in unsupervised learning due to their ability to learn data features and act as a dimensionality reduction method. With rapid evolution of autoencoder methods, there has yet to be a complete study that provides a full autoencoders roadmap for both stimulating technical improvements and orienting research newbies to autoencoders. In this paper, we present a comprehensive survey of autoencoders, starting with an explanation of the principle of conventional autoencoder and their primary development process. We then provide a taxonomy of autoencoders based on their structures and principles and thoroughly analyze and discuss the related models. Furthermore, we review the applications of autoencoders in various fields, including machine vision, natural language processing, complex network, recommender system, speech process, anomaly detection, and others. Lastly, we summarize the limitations of current autoencoder algorithms and discuss the future directions of the field.

Journal Article

Share this book

Add to My Shelf

A review of unsupervised feature selection methods

by Martínez-Trinidad, José Fco , Ariel, Carrasco-Ochoa J , Solorio-Fernández Saúl in Algorithms , Artificial intelligence , Classification

2020

In recent years, unsupervised feature selection methods have raised considerable interest in many research areas; this is mainly due to their ability to identify and select relevant features without needing class label information. In this paper, we provide a comprehensive and structured review of the most relevant and recent unsupervised feature selection methods reported in the literature. We present a taxonomy of these methods and describe the main characteristics and the fundamental ideas they are based on. Additionally, we summarized the advantages and disadvantages of the general lines in which we have categorized the methods analyzed in this review. Moreover, an experimental comparison among the most representative methods of each approach is also presented. Finally, we discuss some important open challenges in this research area.

Journal Article

Share this book

Add to My Shelf

Feature Selection for Varying Coefficient Models With Ultrahigh-Dimensional Covariates

by Li, Runze , Liu, Jingyuan , Wu, Rongling in Body mass index , Conditional correlation , Correlation coefficients

2014

This article is concerned with feature screening and variable selection for varying coefficient models with ultrahigh-dimensional covariates. We propose a new feature screening procedure for these models based on conditional correlation coefficient. We systematically study the theoretical properties of the proposed procedure, and establish their sure screening property and the ranking consistency. To enhance the finite sample performance of the proposed procedure, we further develop an iterative feature screening procedure. Monte Carlo simulation studies were conducted to examine the performance of the proposed procedures. In practice, we advocate a two-stage approach for varying coefficient models. The two-stage approach consists of (a) reducing the ultrahigh dimensionality by using the proposed procedure and (b) applying regularization methods for dimension-reduced varying coefficient models to make statistical inferences on the coefficient functions. We illustrate the proposed two-stage approach by a real data example. Supplementary materials for this article are available online.

Journal Article

Share this book

Add to My Shelf

High-Dimensional Classification Using Features Annealed Independence Rules

by Fan, Yingying , Fan, Jianqing in 62F12 , 62G08 , 62J12

2008

Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is poorly understood. In a seminal paper, Bickel and Levina [Bernoulli 10 (2004) 989-1010] show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as poor as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as poorly as the random guessing. Thus, it is important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample t-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter