Catalogue Search | MBRL

Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review

by Cheng, Wai Khuen , Chan, Jireh Yi-Le , Leow, Steven Mun Hong in Artificial intelligence , Big Data , Data analysis

2022

Technologies have driven big data collection across many fields, such as genomics and business intelligence. This results in a significant increase in variables and data points (observations) collected and stored. Although this presents opportunities to better model the relationship between predictors and the response variables, this also causes serious problems during data analysis, one of which is the multicollinearity problem. The two main approaches used to mitigate multicollinearity are variable selection methods and modified estimator methods. However, variable selection methods may negate efforts to collect more data as new data may eventually be dropped from modeling, while recent studies suggest that optimization approaches via machine learning handle data with multicollinearity better than statistical estimators. Therefore, this study details the chronological developments to mitigate the effects of multicollinearity and up-to-date recommendations to better mitigate multicollinearity.

Journal Article

Share this book

Add to My Shelf

A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams

by Alsini, Raed , Ma, Xiaogang , Soule, Terence in Aircraft , Algorithms , Canonical forms

2021

Outlier detection is a statistical procedure that aims to find suspicious events or items that are different from the normal form of a dataset. It has drawn considerable interest in the field of data mining and machine learning. Outlier detection is important in many applications, including fraud detection in credit card transactions and network intrusion detection. There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points. This paper addresses local outlier detection. The best-known technique for local outlier detection is the Local Outlier Factor (LOF), a density-based technique. There are many LOF algorithms for a static data environment; however, these algorithms cannot be applied directly to data streams, which are an important type of big data. In general, local outlier detection algorithms for data streams are still deficient and better algorithms need to be developed that can effectively analyze the high velocity of data streams to detect local outliers. This paper presents a literature review of local outlier detection algorithms in static and stream environments, with an emphasis on LOF algorithms. It collects and categorizes existing local outlier detection algorithms and analyzes their characteristics. Furthermore, the paper discusses the advantages and limitations of those algorithms and proposes several promising directions for developing improved local outlier detection methods for data streams.

Journal Article

Share this book

Add to My Shelf

Improving big citizen science data: Moving beyond haphazard sampling

by Major, Richard E. , Rowley, Jodi J. L. , Callaghan, Corey T. in Bias , Biodiversity , Biology and Life Sciences

2019

Citizen science is mainstream: millions of people contribute data to a growing array of citizen science projects annually, forming massive datasets that will drive research for years to come. Many citizen science projects implement a \"leaderboard\" framework, ranking the contributions based on number of records or species, encouraging further participation. But is every data point equally \"valuable?\" Citizen scientists collect data with distinct spatial and temporal biases, leading to unfortunate gaps and redundancies, which create statistical and informational problems for downstream analyses. Up to this point, the haphazard structure of the data has been seen as an unfortunate but unchangeable aspect of citizen science data. However, we argue here that this issue can actually be addressed: we provide a very simple, tractable framework that could be adapted by broadscale citizen science projects to allow citizen scientists to optimize the marginal value of their efforts, increasing the overall collective knowledge.

Journal Article

Share this book

Add to My Shelf

A literature review on one-class classification and its potential applications in big data

by Abdollah Zadeh, Azadeh , Seliya, Naeem , Khoshgoftaar, Taghi M. in Application , Big Data , Class imbalance

2021

In severely imbalanced datasets, using traditional binary or multi-class classification typically leads to bias towards the class(es) with the much larger number of instances. Under such conditions, modeling and detecting instances of the minority class is very difficult. One-class classification (OCC) is an approach to detect abnormal data points compared to the instances of the known class and can serve to address issues related to severely imbalanced datasets, which are especially very common in big data. We present a detailed survey of OCC-related literature works published over the last decade, approximately. We group the different works into three categories: outlier detection, novelty detection, and deep learning and OCC. We closely examine and evaluate selected works on OCC such that a good cross section of approaches, methods, and application domains is represented in the survey. Commonly used techniques in OCC for outlier detection and for novelty detection, respectively, are discussed. We observed one area that has been largely omitted in OCC-related literature is its application context for big data and its inherently associated problems, such as severe class imbalance, class rarity, noisy data, feature selection, and data reduction. We feel the survey will be appreciated by researchers working in these areas of big data.

Journal Article

Share this book

Add to My Shelf

A new approach data processing: density-based spatial clustering of applications with noise (DBSCAN) clustering using game-theory

by Kazemi, Uranus , Soleimani, Seyfollah in Accuracy , Algorithms , Big Data

2025

Due to the unpredictable growth of data in various fields, rapid clustering of big data is seriously needed in order to identify the hidden structure of data and discover the relationships between objects. Among clustering methods, density-based clustering methods have an acceptable processing speed for dealing with big data with high dimensions. However, some methods have fixed parameters that are certainly not optimized for all sections. In addition, the complexity of these clustering methods strongly depends on the number of objects. In this paper, a clustering method is presented in order to increase clustering performance and parameter sensitivity according to game-theory and using the concept of Nash equilibrium and dense games, the optimal parameter for clustering is selected and between noise and points clusters make a difference. This method includes (1) searching the grid with several spaces in which there is no cluster, (2) identifying the player through high density data points in order to determine the parameters and (3) combining the clusters to make the game and (4) merging the nearby clusters. The performance of the proposed method was evaluated in four big synthetic datasets, eight real datasets labeled and unlabeled. The obtained results indicate the superiority of the proposed method over SOM, K-means, DBSCAN, SCGPSC methods in terms of accuracy and purity in processing time.

Journal Article

Share this book

Add to My Shelf

Efficient algorithm for big data clustering on single machine

by Alguliyev, Rasim M. , Sukhostat, Lyudmila V. , Aliguliyev, Ramiz M. in Accelerometers , Algorithms , Big Data

2020

Big data analysis requires the presence of large computing powers, which is not always feasible. And so, it became necessary to develop new clustering algorithms capable of such data processing. This study proposes a new parallel clustering algorithm based on the k-means algorithm. It significantly reduces the exponential growth of computations. The proposed algorithm splits a dataset into batches while preserving the characteristics of the initial dataset and increasing the clustering speed. The idea is to define cluster centroids, which are also clustered, for each batch. According to the obtained centroids, the data points belong to the cluster with the nearest centroid. Real large datasets are used to conduct the experiments to evaluate the effectiveness of the proposed approach. The proposed approach is compared with k-means and its modification. The experiments show that the proposed algorithm is a promising tool for clustering large datasets in comparison with the k-means algorithm.

Journal Article

Share this book

Add to My Shelf

A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury

by Wennerberg, Krister , Grafström, Roland C. , Kaski, Samuel in 631/337/2019 , 692/53/2423 , 692/699/1503/1607

2017

Predicting unanticipated harmful effects of chemicals and drug molecules is a difficult and costly task. Here we utilize a ‘big data compacting and data fusion’—concept to capture diverse adverse outcomes on cellular and organismal levels. The approach generates from transcriptomics data set a ‘predictive toxicogenomics space’ (PTGS) tool composed of 1,331 genes distributed over 14 overlapping cytotoxicity-related gene space components. Involving ∼2.5 × 10 8 data points and 1,300 compounds to construct and validate the PTGS, the tool serves to: explain dose-dependent cytotoxicity effects, provide a virtual cytotoxicity probability estimate intrinsic to omics data, predict chemically-induced pathological states in liver resulting from repeated dosing of rats, and furthermore, predict human drug-induced liver injury (DILI) from hepatocyte experiments. Analysing 68 DILI-annotated drugs, the PTGS tool outperforms and complements existing tests, leading to a hereto-unseen level of DILI prediction accuracy. Predicting the hepatotoxic effects of new drugs is still a challenge. Using toxicogenomics data, the authors here define a predictive toxicogenomic space (PTGS), the component gene space capturing dose-dependent cytotoxicity, and demonstrate that it can be used to accurately predict drug-induced liver pathology, including human drug-induced liver injury from in vitro data.

Journal Article

Share this book

Add to My Shelf

Adaptive Radial Basis Function Partition of Unity Interpolation: A Bivariate Algorithm for Unstructured Data

by Cavoretto, Roberto in Adaptive algorithms , Algorithms , Approximation

2021

In this article we present a new adaptive algorithm for solving 2D interpolation problems of large scattered data sets through the radial basis function partition of unity method. Unlike other time-consuming schemes this adaptive method is able to efficiently deal with scattered data points with highly varying density in the domain. This target is obtained by decomposing the underlying domain in subdomains of variable size so as to guarantee a suitable number of points within each of them. The localization of such points is done by means of an efficient search procedure that depends on a partition of the domain in square cells. For each subdomain the adaptive process identifies a predefined neighborhood consisting of one or more levels of neighboring cells, which allows us to quickly find all the subdomain points. The algorithm is further devised for an optimal selection of the local shape parameters associated with radial basis function interpolants via leave-one-out cross validation and maximum likelihood estimation techniques. Numerical experiments show good performance of this adaptive algorithm on some test examples with different data distributions. The efficacy of our interpolation scheme is also pointed out by solving real world applications.

Journal Article

Share this book

Add to My Shelf

FAIR Data Point: A FAIR-Oriented Approach for Metadata Publication

by da Silva Santos, Luiz Olavo Bonino , Burger, Kees , Wilkinson, Mark D. in Access control , Application programming interface , Data points

2023

Metadata, data about other digital objects, play an important role in FAIR with a direct relation to all FAIR principles. In this paper we present and discuss the FAIR Data Point (FDP), a software architecture aiming to define a common approach to publish semantically-rich and machine-actionable metadata according to the FAIR principles. We present the core components and features of the FDP, its approach to metadata provision, the criteria to evaluate whether an application adheres to the FDP specifications and the service to register, index and allow users to search for metadata content of available FDPs.

Journal Article

Share this book

Add to My Shelf

Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models

in Algorithms , Big Data , Biology

2025

In today’s technology-driven era, the proliferation of data is inevitable across various domains. Within engineering, sciences, and business domains, particularly in the context of big data, it can extract actionable insights that can revolutionize the field. Amid data management and analysis, patterns or groups of interconnected data points, commonly referred to as clusters, frequently emerge. These clusters represent distinct subsets containing closely related data points, showcasing unique characteristics compared to other clusters within the same dataset. Spanning across disciplines such as physics, biology, business, and sales, clustering is important in understanding these novel datasets’ essential characteristics, developing complex statistical models, and testing various hypotheses. However, interpreting the characteristics and physical implications of generated clusters by different clustering algorithms is challenging for researchers unfamiliar with these algorithms’ inner workings. This research addresses the intricacies of comprehending data clustering, cluster attributes, and evaluation metrics, especially for individuals lacking proficiency in clustering or related disciplines like statistics. The primary objective of this study is to simplify cluster analysis by furnishing users or analysts from diverse domains with succinct linguistic synopses of clustering results, circumventing the necessity for intricate numerical or mathematical terms. Deep learning techniques based on large language models, such as encoder-decoders (for example, the T5 model) and generative pre-trained transformers (GPTs), are employed to achieve this. This study aims to construct a summarization model capable of ingesting data clusters, producing a condensed overview of the contained insights in a simplified, easily understandable linguistic format. The evaluation process revealed a clear preference among evaluators for the summaries generated by GPT, with T5 summaries following closely behind. GPT and T5 summaries were good at fluency, demonstrating their ability to capture the original content in a human-like manner. In contrast, while providing a structured framework for summarization, the linguistic protoform-based approach is needed to match the quality and coherence of the GPT and T5 summaries.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter