Catalogue Search | MBRL

Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy

by Varotto, Giulia , Susi, Gianluca , Panzica, Ferruccio in Classification , Convulsions & seizures , Datasets

2021

Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery. Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered. Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method. Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

Journal Article

Share this book

Add to My Shelf

Robust password security: a genetic programming approach with imbalanced dataset handling

by Baressi S̆egota, Sandi , Car, Zlatan , Andelić, Nikola in Access control , Accuracy , Artificial intelligence

2024

Developing a method for determining password strength using artificial intelligence (AI) is crucial as it enhances cybersecurity by providing a more robust defense against unauthorized access. AI can analyze complex patterns and trends, allowing for the identification of weak passwords and potential vulnerabilities more effectively than traditional methods. This proactive approach helps users and organizations strengthen their security posture, reducing the risk of data breaches and unauthorized intrusions. In this paper, the genetic programming symbolic classifier (GPSC) was applied to the publicly available dataset to obtain a set of symbolic expressions for password strength classification with high classification accuracy. One of the problems with the dataset was an imbalance between classes so various oversampling/undersampling techniques have been utilized. The optimal GPSC hyperparameter values were found using the random hyperparameter value search method. The algorithm was trained using fivefold cross-validation (5FCV). One of the problems with the dataset was an imbalance between classes so various oversampling/undersampling techniques have been utilized. To evaluate obtained SEs, the evaluation metric accuracy, area under receiver operating characteristics curve, precision, recall, and f 1-score were used. The obtained SEs on balanced dataset variations achieved high classification accuracy (0.99) and with the application of all SEs on the entire original imbalanced dataset achieved the same accuracy.

Journal Article

Share this book

Add to My Shelf

Machine Learning Algorithms Analysis of Synthetic Minority Oversampling Technique (SMOTE): Application to Credit Default Prediction

by Emmanuel de-Graft Johnson Owusu-Ansah , Yaa Kyere Adwubi , Doamekpor, Richard in credit scoring, smote, oversampling, undersampling, class imbalance, machine learning algorithms

2024

Credit default prediction is an important problem in financial risk management. It aims to determine the possibility of borrowers failing on their loan commitments. However, dataset to guide Machine Learning modeling procedure for data driven support suffers from class imbalance. Class imbalance in Machine Learning is an unbalanced distribution of classes within a dataset. This problem often arises in classification jobs if the distribution of classes or labels in a dataset is not uniform. To overcome this issue, just resample by adding or removing entries from the minority or majority classes. The present study looks on the efficacy of classification algorithms employing various data balancing approaches. The dataset was collected from a well-known commercial bank in Ghana. To resolve the imbalance, three data balancing approaches were used: under-sampling, oversampling, and the synthetic minority oversampling technique (SMOTE). Findings, with the exception of the SMOTE dataset, XGBoost consistently beat the other classifiers across the other datasets in terms of AUC. Random forest, decision tree, and logistic regression all performed well and might be utilized as alternatives to XGBoost classifiers for developing credit scoring models. The findings demonstrate that classifiers trained on balanced datasets have higher sensitivity scores than those trained on the original skewed dataset, while maintaining their capacity to differentiate between defaulters and non-defaulters. This demonstrates the value of data balancing strategies in increasing models' ability to anticipate minority class occurrences, Hence, the major discovery is that oversampling outperforms under-sampling across classifiers and evaluation measures is affirmed.

Journal Article

Share this book

Add to My Shelf

Innovating Intrusion Detection Classification Analysis for an Imbalanced Data Sample

by Humayed, Abdulmalik A. , Adam, Yagoub Abbker , Isah, Ibrahim in Algorithms , Big Data , Case studies

2025

This work is designed to assist researchers and interested learners in comprehending and putting deep machine learning classification approaches into practice. It aimed to simplify, facilitate, and advance classification methodology skills. To make it easier for the users to understand, it employed a methodical approach. The categorization assessment measures seek to give the fundamentals of these measures and demonstrate how they operate to function as a comprehensive resource for academics interested in this area. Intrusion detection and threat analysis (IDAT) is a particularly unpleasant cybersecurity issue. In this study, IDAT is identified as a case study, and a real-sample dataset that was used for institutional and community awareness was generated by the researchers. This review shows that, to solve a classification problem, it is crucial to use the output of classification in terms of performance measurements, encompassing both conventional criteria and contemporary metrics. This study focused on addressing the dynamic of classification assessment capabilities for using both scalars and visual metrics, and to fix imbalanced dataset difficulties. In conclusion, this review is a useful tool for researchers, especially when they are working on big data preprocessing, handling imbalanced data for multiclass assessment, and ML classification.

Journal Article

Share this book

Add to My Shelf

A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining

by Wongvorachan, Tarid , He, Surina , Bulut, Okan in Accuracy , Algorithms , Bias

2023

Educational data mining is capable of producing useful data-driven applications (e.g., early warning systems in schools or the prediction of students’ academic achievement) based on predictive models. However, the class imbalance problem in educational datasets could hamper the accuracy of predictive models as many of these models are designed on the assumption that the predicted class is balanced. Although previous studies proposed several methods to deal with the imbalanced class problem, most of them focused on the technical details of how to improve each technique, while only a few focused on the application aspect, especially for the application of data with different imbalance ratios. In this study, we compared several sampling techniques to handle the different ratios of the class imbalance problem (i.e., moderately or extremely imbalanced classifications) using the High School Longitudinal Study of 2009 dataset. For our comparison, we used random oversampling (ROS), random undersampling (RUS), and the combination of the synthetic minority oversampling technique for nominal and continuous (SMOTE-NC) and RUS as a hybrid resampling technique. We used the Random Forest as our classification algorithm to evaluate the results of each sampling technique. Our results show that random oversampling for moderately imbalanced data and hybrid resampling for extremely imbalanced data seem to work best. The implications for educational data mining applications and suggestions for future research are discussed.

Journal Article

Share this book

Add to My Shelf

The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art

by Susan, Seba , Kumar, Amitesh in class‐imbalance problem , hybrid sampling , imbalanced data

2021

This survey paper focuses on one of the current primary issues challenging data mining researchers experimenting on real‐world datasets. The problem is that of imbalanced class distribution that generates a bias toward the majority class due to insufficient training samples from the minority class. The current machine learning and deep learning algorithms are trained on datasets that are insufficiently represented in certain categories. On the other hand, some other classes have surplus samples due to the ready availability of data from these categories. Conventional solutions suggest undersampling of the majority class and/or oversampling of the minority class for balancing the class distribution prior to the learning phase. Though this problem of uneven class distribution is, by and large, ignored by researchers focusing on the learning technology, a need has now arisen for incorporating balance correction and data pruning procedures within the learning process itself. This paper surveys a plethora of conventional and recent techniques that address this issue through intelligent representations of samples from the majority and minority classes, that are given as input to the learning module. The application of nature‐inspired evolutionary algorithms to intelligent sampling is examined, and so are hybrid sampling strategies that select and retain the difficult‐to‐learn samples and discard the easy‐to‐learn samples. The findings by various researchers are summarized to a logical end, and various possibilities and challenges for future directions in research are outlined. This paper surveys recent sampling techniques addressing the class‐imbalance issue. The application of nature‐inspired evolutionary optimization techniques to intelligent sampling is examined and so are hybrid sampling strategies that select and retain the difficult‐to‐learn samples and discard the easy‐to‐learn samples. The findings by various researchers are summarized to a logical end, and various possibilities for the future are outlined.

Journal Article

Share this book

Add to My Shelf

Resampling imbalanced data for network intrusion detection datasets

by Bagui, Sikha , Li, Kunqi in Adaptive sampling , Adjustment , Artificial neural networks

2021

Machine learning plays an increasingly significant role in the building of Network Intrusion Detection Systems. However, machine learning models trained with imbalanced cybersecurity data cannot recognize minority data, hence attacks, effectively. One way to address this issue is to use resampling, which adjusts the ratio between the different classes, making the data more balanced. This research looks at resampling’s influence on the performance of Artificial Neural Network multi-class classifiers. The resampling methods, random undersampling, random oversampling, random undersampling and random oversampling, random undersampling with Synthetic Minority Oversampling Technique, and random undersampling with Adaptive Synthetic Sampling Method were used on benchmark Cybersecurity datasets, KDD99, UNSW-NB15, UNSW-NB17 and UNSW-NB18. Macro precision, macro recall, macro F1-score were used to evaluate the results. The patterns found were: First, oversampling increases the training time and undersampling decreases the training time; second, if the data is extremely imbalanced, both oversampling and undersampling increase recall significantly; third, if the data is not extremely imbalanced, resampling will not have much of an impact; fourth, with resampling, mostly oversampling, more of the minority data (attacks) were detected.

Journal Article

Share this book

Add to My Shelf

Hybrid clustering strategies for effective oversampling and undersampling in multiclass classification

by Salehi, Amirreza , Khedmati, Majid in 639/705/1041 , 639/705/1042 , 639/705/1046

2025

Multiclass imbalance is a challenging problem in real-world datasets, where certain classes may have a low number of samples because they correspond to rare occurrences. To address the challenge of multiclass imbalance, this paper introduces a novel hybrid cluster-based oversampling and undersampling (HCBOU) technique. By clustering and separating classes into majority and minority categories, this algorithm retains the most information during undersampling while generating efficient data in the minority class. The classification is carried out using one-vs-one and one-vs-all decomposition schemes. Extensive experimentation was carried out on 30 datasets to evaluate the proposed algorithm's performance. The results were subsequently compared with those of several state-of-the-art algorithms. Based on the results, the proposed algorithm outperforms the competing algorithms under different scenarios. Finally, The HCBOU algorithm demonstrated robust performance across varying class imbalance levels, highlighting its effectiveness in handling imbalanced datasets.

Journal Article

Share this book

Add to My Shelf

CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification

by Moreno-Garcia, Carlos Francisco , Elyan, Eyad , Jayne, Chrisina in Artificial Intelligence , Classification , Computational Biology/Bioinformatics

2021

Class-imbalanced datasets are common across several domains such as health, banking, security, and others. The dominance of majority class instances (negative class) often results in biased learning models, and therefore, classifying such datasets requires employing some methods to compact the problem. In this paper, we propose a new hybrid approach aiming at reducing the dominance of the majority class instances using class decomposition and increasing the minority class instances using an oversampling method. Unlike other undersampling methods, which suffer data loss, our method preserves the majority class instances, yet significantly reduces its dominance, resulting in a more balanced dataset and hence improving the results. A large-scale experiment using 60 public datasets was carried out to validate the proposed methods. The results across three standard evaluation metrics show the comparable and superior results with other common and state-of-the-art techniques.

Journal Article

Share this book

Add to My Shelf

Federated learning model for credit card fraud detection with data balancing techniques

by Fouad, Khaled M. , Elbably, Doaa L. , Abdul Salam, Mustafa in Accuracy , Artificial Intelligence , Classification

2024

In recent years, credit card transaction fraud has resulted in massive losses for both consumers and banks. Subsequently, both cardholders and banks need a strong fraud detection system to reduce cardholder losses. Credit card fraud detection (CCFD) is an important method of fraud prevention. However, there are many challenges in developing an ideal fraud detection system for banks. First off, due to data security and privacy concerns, various banks and other financial institutions are typically not permitted to exchange their transaction datasets. These issues make traditional systems find it difficult to learn and detect fraud depictions. Therefore, this paper proposes federated learning for CCFD over different frameworks (TensorFlow federated, PyTorch). Second, there is a significant imbalance in credit card transactions across all banks, with a small percentage of fraudulent transactions outweighing the majority of valid ones. In order to demonstrate the urgent need for a comprehensive investigation of class imbalance management techniques to develop a powerful model to identify fraudulent transactions, the dataset must be balanced. In order to address the issue of class imbalance, this study also seeks to give a comparative analysis of several individual and hybrid resampling techniques. In several experimental studies, the effectiveness of various resampling techniques in combination with classification approaches has been compared. In this study, it is found that the hybrid resampling methods perform well for machine learning classification models compared to deep learning classification models. The experimental results show that the best accuracy for the Random Forest (RF); Logistic Regression; K-Nearest Neighbors (KNN); Decision Tree (DT), and Gaussian Naive Bayes (NB) classifiers are 99,99%; 94,61%; 99.96%; 99,98%, and 91,47%, respectively. The comparative results show that the RF outperforms with high performance parameters (accuracy, recall, precision and f score) better than NB; RF; DT and KNN. RF achieve the minimum loss values with all resampling techniques, and the results, when utilizing the proposed models on the entire skewed dataset, achieved preferable outcomes to the unbalanced dataset. Furthermore, the PyTorch framework achieves higher prediction accuracy for the federated learning model than the TensorFlow federated framework but with more computational time.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter