Catalogue Search | MBRL

The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art

by Susan, Seba , Kumar, Amitesh in class‐imbalance problem , hybrid sampling , imbalanced data

2021

This survey paper focuses on one of the current primary issues challenging data mining researchers experimenting on real‐world datasets. The problem is that of imbalanced class distribution that generates a bias toward the majority class due to insufficient training samples from the minority class. The current machine learning and deep learning algorithms are trained on datasets that are insufficiently represented in certain categories. On the other hand, some other classes have surplus samples due to the ready availability of data from these categories. Conventional solutions suggest undersampling of the majority class and/or oversampling of the minority class for balancing the class distribution prior to the learning phase. Though this problem of uneven class distribution is, by and large, ignored by researchers focusing on the learning technology, a need has now arisen for incorporating balance correction and data pruning procedures within the learning process itself. This paper surveys a plethora of conventional and recent techniques that address this issue through intelligent representations of samples from the majority and minority classes, that are given as input to the learning module. The application of nature‐inspired evolutionary algorithms to intelligent sampling is examined, and so are hybrid sampling strategies that select and retain the difficult‐to‐learn samples and discard the easy‐to‐learn samples. The findings by various researchers are summarized to a logical end, and various possibilities and challenges for future directions in research are outlined. This paper surveys recent sampling techniques addressing the class‐imbalance issue. The application of nature‐inspired evolutionary optimization techniques to intelligent sampling is examined and so are hybrid sampling strategies that select and retain the difficult‐to‐learn samples and discard the easy‐to‐learn samples. The findings by various researchers are summarized to a logical end, and various possibilities for the future are outlined.

Journal Article

Share this book

Add to My Shelf

Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data

by Rijnbeek, Peter R , Yang, Cynthia , Fridgeirsson, Egill A in Big Data , Calibration , Classifiers

2024

BackgroundThere is currently no consensus on the impact of class imbalance methods on the performance of clinical prediction models. We aimed to empirically investigate the impact of random oversampling and random undersampling, two commonly used class imbalance methods, on the internal and external validation performance of prediction models developed using observational health data.MethodsWe developed and externally validated prediction models for various outcomes of interest within a target population of people with pharmaceutically treated depression across four large observational health databases. We used three different classifiers (lasso logistic regression, random forest, XGBoost) and varied the target imbalance ratio. We evaluated the impact on model performance in terms of discrimination and calibration. Discrimination was assessed using the area under the receiver operating characteristic curve (AUROC) and calibration was assessed using calibration plots.ResultsWe developed and externally validated a total of 1,566 prediction models. On internal and external validation, random oversampling and random undersampling generally did not result in higher AUROCs. Moreover, we found overestimated risks, although this miscalibration could largely be corrected by recalibrating the models towards the imbalance ratios in the original dataset.ConclusionsOverall, we found that random oversampling or random undersampling generally does not improve the internal and external validation performance of prediction models developed in large observational health databases. Based on our findings, we do not recommend applying random oversampling or random undersampling when developing prediction models in large observational health databases.

Journal Article

Share this book

Add to My Shelf

An Oversampling Method for Class Imbalance Problems on Large Datasets

by Rodríguez-Torres, Fredy , Martínez-Trinidad, José F. , Carrasco-Ochoa, Jesús A. in Cancer , class imbalance problem , Datasets

2022

Several oversampling methods have been proposed for solving the class imbalance problem. However, most of them require searching the k-nearest neighbors to generate synthetic objects. This requirement makes them time-consuming and therefore unsuitable for large datasets. In this paper, an oversampling method for large class imbalance problems that do not require the k-nearest neighbors’ search is proposed. According to our experiments on large datasets with different sizes of imbalance, the proposed method is at least twice as fast as 8 the fastest method reported in the literature while obtaining similar oversampling quality.

Journal Article

Share this book

Add to My Shelf

Deep-HAR: an ensemble deep learning model for recognizing the simple, complex, and heterogeneous human activities

in Activity recognition , Artificial neural networks , Deep learning

2023

The recognition of human activities has become a dominant emerging research problem and widely covered application areas in surveillance, wellness management, healthcare, and many more. In real life, the activity recognition is a challenging issue because human beings are often performing the activities not only simple but also complex and heterogeneous in nature. Most of the existing approaches are addressing the problem of recognizing only simple straightforward activities (e.g. walking, running, standing, sitting, etc.). Recognizing the complex and heterogeneous human activities are a challenging research problem whereas only a limited number of existing works are addressing this issue. In this paper, we proposed a novel Deep-HAR model by ensembling the Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) for recognizing the simple, complex, and heterogeneous type activities. Here, the CNNs are used for extracting the features whereas RNNs are used for finding the useful patterns in time-series sequential data. The activities recognition performance of the proposed model was evaluated using three different publicly available datasets, namely WISDM, PAMAP2, and KU-HAR. Through extensive experiments, we have demonstrated that the proposed model performs well in recognizing all types of activities and has achieved an accuracy of 99.98%, 99.64%, and 99.98% for simple, complex, and heterogeneous activities respectively.

Journal Article

Share this book

Add to My Shelf

Dealing with the Class Imbalance Problem in the Detection of Fake Job Descriptions

by Le, Tuong , Thanh Vo, Minh , H. Vo, Anh in Datasets , Descriptions , Feature extraction

2021

In recent years, the detection of fake job descriptions has become increasingly necessary because social networking has changed the way people access burgeoning information in the internet age. Identifying fraud in job descriptions can help jobseekers to avoid many of the risks of job hunting. However, the problem of detecting fake job descriptions comes up against the problem of class imbalance when the number of genuine jobs exceeds the number of fake jobs. This causes a reduction in the predictability and performance of traditional machine learning models. We therefore present an efficient framework that uses an oversampling technique called FJD-OT (Fake Job Description Detection Using Oversampling Techniques) to improve the predictability of detecting fake job descriptions. In the proposed framework, we apply several techniques including the removal of stop words and the use of a tokenizer to preprocess the text data in the first module. We then use a bag of words in combination with the term frequency-inverse document frequency (TF-IDF) approach to extract the features from the text data to create the feature dataset in the second module. Next, our framework applies k-fold cross-validation, a commonly used technique to test the effectiveness of machine learning models, that splits the experimental dataset [the Employment Scam Aegean (ESA) dataset in our study] into training and test sets for evaluation. The training set is passed through the third module, an oversampling module in which the SVMSMOTE method is used to balance data before training the classifiers in the last module. The experimental results indicate that the proposed approach significantly improves the predictability of fake job description detection on the ESA dataset based on several popular performance metrics.

Journal Article

Share this book

Add to My Shelf

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

by Isidro-Ortega, Frank J. , Rendón, Eréndira , Granda-Gutiérrez, Everardo E. in Algorithms , Big Data , Classification

2020

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.

Journal Article

Share this book

Add to My Shelf

Predicting pathological response to neoadjuvant chemoradiotherapy in locally advanced rectal cancer with two step feature selection and ensemble learning

by Ge, Ran , Shi, Fangmin , Wang, Hui in 639/166/985 , 692/53/2423 , Biomarkers

2025

Patients with locally advanced rectal cancer (LARC) show substantial individual variability and a pronounced imbalance in response distribution to neoadjuvant chemoradiotherapy (nCRT), posing significant challenges to treatment response prediction. This study aims to identify effective predictive biomarkers and develop an ensemble learning-based prediction model to assess the response of LARC patients to nCRT. A two-step feature selection method was developed to identify predictive biomarkers by deriving stable reversal gene pairs through within-sample relative expression orderings (REOs) from LARC patients undergoing nCRT. Preliminary screening utilized four methods—MDFS, Boruta, MCFS, and VSOLassoBag—to form a candidate feature set. Secondary screening ranked these features by permutation importance, applying Incremental Feature Selection (IFS) with an Extreme Gradient Boosting (XGBoost) to determine final predictive gene pairs. The ensemble model BoostForest, combining boosting and bagging, served as the predictive framework, with SHAP employed for interpretability. Through two-step feature selection, the 32-gene pair signature (32-GPS) was established as the final predictive biomarker. In the test set, the model achieved an area under the precision-recall curve (AUPRC) of 0.983 and an accuracy of 0.988. In the validation cohort, the AUPRC was 0.785, with an accuracy of 0.898, indicating strong model performance. The study further demonstrated that BoostForest achieved superior overall performance compared to Random Forest, Support Vector Machine (SVM), and XGBoost. To evaluate the effectiveness of the 32-GPS, its performance was compared with two alternative feature sets: the lasso-gene pair signature (lasso-GPS), derived through lasso regression, and the 15-shared gene pair signature (15-SGPS), consisting of gene pairs identified by all four feature selection methods. The 32-GPS demonstrated superior performance in both comparisons. The two-step feature selection method identified robust predictive biomarkers, and BoostForest outperformed Random Forest, Support Vector Machine, and XGBoost in classification performance and predictive capability.

Journal Article

Share this book

Add to My Shelf

A synthetic neighborhood generation based ensemble learning for the imbalanced data classification

by Chen, Zhi , Lin, Tao , Xu, Hongyan in Classification , Classifiers , Data mining

2018

Constructing effective classifiers from imbalanced datasets has emerged as one of the main challenges in the data mining community, due to its increased prevalence in various real-world domains. Ensemble solutions are quite often applied in this field for their ability to provide better classification ability than single classifier. However, most existing methods adopt data sampling to train the base classifier on balanced datasets, but not to directly enhance the diversity. Thus, the performance of the final classifier can be limited. This paper suggests a new ensemble learning that can address the class imbalance problem and promote diversity simultaneously. Inspired by the localized generalization error model, this paper generates some synthetic samples located within some local area of the training samples, and trains the base classifiers with the union of original training samples and synthetic neighborhoods samples. By controlling the number of generated samples, the base classifiers can be trained with balanced datasets. Meanwhile, as the generated samples can extend different parts of the original input space and can be quite different from the original training samples, the obtained base classifiers are guaranteed to be accurate and diverse. A thorough experimental study on 36 benchmark datasets was performed, and the experimental results demonstrated that our proposed method can deliver significant better performance than the state-of-the-art ensemble solutions for the imbalanced problems.

Journal Article

Share this book

Add to My Shelf

Two-Stage Hybrid Data Classifiers Based on SVM and kNN Algorithms

by Demidova, Liliya A. in Algorithms , Classification , Classifiers

2021

The paper considers a solution to the problem of developing two-stage hybrid SVM-kNN classifiers with the aim to increase the data classification quality by refining the classification decisions near the class boundary defined by the SVM classifier. In the first stage, the SVM classifier with default parameters values is developed. Here, the training dataset is designed on the basis of the initial dataset. When developing the SVM classifier, a binary SVM algorithm or one-class SVM algorithm is used. Based on the results of the training of the SVM classifier, two variants of the training dataset are formed for the development of the kNN classifier: a variant that uses all objects from the original training dataset located inside the strip dividing the classes, and a variant that uses only those objects from the initial training dataset that are located inside the area containing all misclassified objects from the class dividing strip. In the second stage, the kNN classifier is developed using the new training dataset above-mentioned. The values of the parameters of the kNN classifier are determined during training to maximize the data classification quality. The data classification quality using the two-stage hybrid SVM-kNN classifier was assessed using various indicators on the test dataset. In the case of the improvement of the quality of classification near the class boundary defined by the SVM classifier using the kNN classifier, the two-stage hybrid SVM-kNN classifier is recommended for further use. The experimental results approve the feasibility of using two-stage hybrid SVM-kNN classifiers in the data classification problem. The experimental results obtained with the application of various datasets confirm the feasibility of using two-stage hybrid SVM-kNN classifiers in the data classification problem.

Journal Article

Share this book

Add to My Shelf

OBMI: oversampling borderline minority instances by a two-stage Tomek link-finding procedure for class imbalance problem

by Leng, Qiangkui , Guo, Jiamei , Tao, Jiaqing in Borderline instance , Class imbalance problem , Complexity

2024

Mitigating the impact of class imbalance datasets on classifiers poses a challenge to the machine learning community. Conventional classifiers do not perform well as they are habitually biased toward the majority class. Among existing solutions, the synthetic minority oversampling technique (SMOTE) has shown great potential, aiming to improve the dataset rather than the classifier. However, SMOTE still needs improvement because of its equal oversampling to each minority instance. Based on the consensus that instances far from the borderline contribute less to classification, a refined method for oversampling borderline minority instances (OBMI) is proposed in this paper using a two-stage Tomek link-finding procedure. In the oversampling stage, the pairs of between-class instances nearest to each other are first found to form Tomek links. Then, these minority instances in Tomek links are extracted as base instances. Finally, new minority instances are generated, each of which is linearly interpolated between a base instance and one minority neighbor of the base instance. To address the overlap caused by oversampling, in the cleaning stage, Tomek links are employed again to remove the borderline instances from both classes. The OBMI is compared with ten baseline methods on 17 benchmark datasets. The results show that it performs better on most of the selected datasets in terms of the F 1- score and G-mean . Statistical analysis also indicates its higher-level Friedman ranking.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter