Catalogue Search | MBRL

SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

by Garcia, Salvador , Herrera, Francisco , Fernandez, Alberto in Algorithms , Artificial intelligence , Machine learning

2018

The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \"de facto\" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several different domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of different software packages - from open source to commercial. In this paper, marking the fifteen year anniversary of SMOTE, we reflect on the SMOTE journey, discuss the current state of affairs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.

Journal Article

Share this book

Add to My Shelf

An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques

by Alsowail, Rakan A. , Al-Shehari, Taher in Algorithms , Bias , Cybersecurity

2021

Insider threats are malicious acts that can be carried out by an authorized employee within an organization. Insider threats represent a major cybersecurity challenge for private and public organizations, as an insider attack can cause extensive damage to organization assets much more than external attacks. Most existing approaches in the field of insider threat focused on detecting general insider attack scenarios. However, insider attacks can be carried out in different ways, and the most dangerous one is a data leakage attack that can be executed by a malicious insider before his/her leaving an organization. This paper proposes a machine learning-based model for detecting such serious insider threat incidents. The proposed model addresses the possible bias of detection results that can occur due to an inappropriate encoding process by employing the feature scaling and one-hot encoding techniques. Furthermore, the imbalance issue of the utilized dataset is also addressed utilizing the synthetic minority oversampling technique (SMOTE). Well known machine learning algorithms are employed to detect the most accurate classifier that can detect data leakage events executed by malicious insiders during the sensitive period before they leave an organization. We provide a proof of concept for our model by applying it on CMU-CERT Insider Threat Dataset and comparing its performance with the ground truth. The experimental results show that our model detects insider data leakage events with an AUC-ROC value of 0.99, outperforming the existing approaches that are validated on the same dataset. The proposed model provides effective methods to address possible bias and class imbalance issues for the aim of devising an effective insider data leakage detection system.

Journal Article

Share this book

Add to My Shelf

Machine learning-based network intrusion detection for big and imbalanced data using oversampling, stacking feature embedding and feature extraction

by Talukder, Md. Alamin , Moni, Mohammad Ali , Uddin, Md Ashraf in Abnormalities , Accuracy , Applied behavior analysis

2024

Cybersecurity has emerged as a critical global concern. Intrusion Detection Systems (IDS) play a critical role in protecting interconnected networks by detecting malicious actors and activities. Machine Learning (ML)-based behavior analysis within the IDS has considerable potential for detecting dynamic cyber threats, identifying abnormalities, and identifying malicious conduct within the network. However, as the number of data grows, dimension reduction becomes an increasingly difficult task when training ML models. Addressing this, our paper introduces a novel ML-based network intrusion detection model that uses Random Oversampling (RO) to address data imbalance and Stacking Feature Embedding based on clustering results, as well as Principal Component Analysis (PCA) for dimension reduction and is specifically designed for large and imbalanced datasets. This model’s performance is carefully evaluated using three cutting-edge benchmark datasets: UNSW-NB15, CIC-IDS-2017, and CIC-IDS-2018. On the UNSW-NB15 dataset, our trials show that the RF and ET models achieve accuracy rates of 99.59% and 99.95%, respectively. Furthermore, using the CIC-IDS2017 dataset, DT, RF, and ET models reach 99.99% accuracy, while DT and RF models obtain 99.94% accuracy on CIC-IDS2018. These performance results continuously outperform the state-of-art, indicating significant progress in the field of network intrusion detection. This achievement demonstrates the efficacy of the suggested methodology, which can be used practically to accurately monitor and identify network traffic intrusions, thereby blocking possible threats.

Journal Article

Share this book

Add to My Shelf

Improving AdaBoost-based Intrusion Detection System (IDS) Performance on CIC IDS 2017 Dataset

by Yulianto, Arif , Sukarno, Parman , Suwastika, Novian Anggis in Classifiers , Datasets , Feature selection

2019

This paper considers the use of Synthetic Minority Oversampling Technique (SMOTE), Principal Component Analysis (PCA), and Ensemble Feature Selection (EFS) to improve the performance of AdaBoost-based Intrusion Detection System (IDS) on the latest and challenging CIC IDS 2017 Dataset [1]. Previous research [1] has proposed the use of AdaBoost classifier to cope with the new dataset. However, due to several problems such as imbalance of training data and inappropriate selection of classification methods, the performance is still inferior. In this research, we aim at constructing an improvement performance intrusion detection approach to handle the imbalance of training data, SMOTE is selected to tackle the problem. Moreover, Principal Component Analysis (PCA) and Ensemble Feature Selection (EFS) are applied as the feature selection to select important attributes from the new dataset. The evaluation results show that the proposed AdaBoost classifier using PCA and SMOTE yields Area Under the Receiver Operating Characteristic curve (AUROC) of 92% and the AdaBoost classifier using EFS and SMOTE produces an accuracy, precision, recall, and F1 Score of 81.83 %, 81.83%, 100%, and 90.01% respectively.

Journal Article

Share this book

Add to My Shelf

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

by Villar, Monica Gracia , Kına, EROL , Mujahid, Muhammad in Accuracy , Big Data , Classification

2024

The classification of imbalanced datasets is a prominent task in text mining and machine learning. The number of samples in each class is not uniformly distributed; one class contains a large number of samples while the other has a small number. Overfitting of the model occurs as a result of imbalanced datasets, resulting in poor performance. In this study, we compare different oversampling techniques like synthetic minority oversampling technique (SMOTE), support vector machine SMOTE (SVM-SMOTE), Border-line SMOTE, K-means SMOTE, and adaptive synthetic (ADASYN) oversampling to address the issue of imbalanced datasets and enhance the performance of machine learning models. Preprocessing significantly enhances the quality of input data by reducing noise, redundant data, and unnecessary data. This enables the machines to identify crucial patterns that facilitate the extraction of significant and pertinent information from the preprocessed data. This study preprocesses the data using various top-level preprocessing steps. Furthermore, two imbalanced Twitter datasets are used to compare the performance of oversampling techniques with six machine learning models including random forest (RF), SVM, K-nearest neighbor (KNN), AdaBoost (ADA), logistic regression (LR), and decision tree (DT). In addition, the bag of words (BoW) and term frequency and inverse document frequency (TF-IDF) features extraction approaches are used to extract features from the tweets. The experiments indicate that SMOTE and ADASYN perform much better than other techniques thus providing higher accuracy. Additionally, overall results show that SVM with ’linear’ kernel tends to attain the highest accuracy and recall score of 99.67% and 1.00% on ADASYN oversampled datasets and 99.57% accuracy on SMOTE oversampled dataset with TF-IDF features. The SVM model using 10-fold cross-validation experiments achieved 97.40 mean accuracy with a 0.008 standard deviation. Our approach achieved 2.62% greater accuracy as compared to other current methods.

Journal Article

Share this book

Add to My Shelf

Evaluation performance recall and F2 score of credit card fraud detection unbalanced dataset using SMOTE oversampling technique

by Baroroh, N , Muslim, M A , Prasetiyo, B in Algorithms , Datasets , Fraud

2021

Unbalanced data becomes an interesting research and continues to be studied because of its uniqueness. Unbalanced data requires special treatment prior to making the data balance. In this paper, our study to investigate the performance of unbalanced dataset using diverse oversampling proportion. We use SMOTE to gerentae new syntethic data, then we classify using random forest algorithm. In our experiment we generate new sampling with start 20%, 40%, 60%, 80%, and 100% of majority class, so that the data balancing until 50%: 50%. Each new generated data, we train the data using classification technique. Then, evaluate each algorithm performance. We show that the highest F2 score i.e: 85.34 and 84.93. The new data generated is 60% of majority class, result F2 score 85.34, then the new data generated from 100% of majority class result F2 score 84.93.

Journal Article

Share this book

Add to My Shelf

ASN-SMOTE: a synthetic minority oversampling method with adaptive qualified synthesizer selection

by Yi, Xinkai , Xu, Yingying , Li, Wei in Algorithms , Classifiers , Complexity

2022

Oversampling is a promising preprocessing technique for imbalanced datasets which generates new minority instances to balance the dataset. However, improper generated minority instances, i.e., noise instances, may interfere the learning of the classifier and impact it negatively. Given this, in this paper, we propose a simple and effective oversampling approach known as ASN-SMOTE based on the k -nearest neighbors and the synthetic minority oversampling technology (SMOTE). ASN-SMOTE first filters noise in the minority class by determining whether the nearest neighbor of each minority instance belongs to the minority or majority class. After that, ASN-SMOTE uses the nearest majority instance of each minority instance to effectively perceive the decision boundary, inside which the qualified minority instances are selected adaptively for each minority instance by the proposed adaptive neighbor selection scheme to synthesize new minority instance. To substantiate the effectiveness, ASN-SMOTE has been applied to three different classifiers and comprehensive experiments have been conducted on 24 imbalanced benchmark datasets. ASN-SMOTE is also extensively compared with nine notable oversampling algorithms. The results show that ASN-SMOTE achieves the best results in the majority of datasets. The ASN-SMOTE implementation is available at: https://www.github.com/yixinkai123/ASN-SMOTE/ .

Journal Article

Share this book

Add to My Shelf

WISEST: Weighted Interpolation for Synthetic Enhancement Using SMOTE with Thresholds

by Suganuma, Takuo , Matsui, Ryotaro , Guillen, Luis in Algorithms , Artificial intelligence , Benchmarks

2025

Imbalanced learning occurs when rare but critical events are missed because classifiers are trained primarily on majority-class samples. This paper introduces WISEST, a locality-aware weighted-interpolation algorithm that generates synthetic minority samples within a controlled threshold near class boundaries. Benchmarked on more than a hundred real-world imbalanced datasets, such as KEEL, with different imbalance ratios, noise levels, geometries, and other security and IoT sets (IoT-23 and BoT-IoT), WISEST consistently improved minority detection in at least one of the metrics on about half of those datasets, achieving up to a 25% relative recall increase and up to an 18% increase in F1 compared to the original training and other approaches. However, in most cases, WISEST's trade-off gains are in accuracy and precision depending on the dataset and classifier. These results indicate that WISEST is a practical and robust option when minority support and borderline structure permit safe synthesis, although no single sampler uniformly outperforms others across all datasets.

Journal Article

Share this book

Add to My Shelf

Smart pathological brain detection by synthetic minority oversampling technique, extreme learning machine, and Jaya algorithm

by Zhao, Guihu , Yu-Dong, Zhang , Govindaraj, Vishnu Varthanan in Algorithms , Brain , Genetic algorithms

2018

Pathological brain detection is an automated computer-aided diagnosis for brain images. This study provides a novel method to achieve this goal.We first used synthetic minority oversampling to balance the dataset. Then, our system was based on three components: wavelet packet Tsallis entropy, extreme learning machine, and Jaya algorithm. The 10 repetitions of K-fold cross validation showed our method achieved perfect classification on two small datasets, and achieved a sensitivity of 99.64 ± 0.52%, a specificity of 99.14 ± 1.93%, and an accuracy of 99.57 ± 0.57% over a 255-image dataset. Our method performs better than six state-of-the-art approaches. Besides, Jaya algorithm performs better than genetic algorithm, particle swarm optimization, and bat algorithm as ELM training method.

Journal Article

Share this book

Add to My Shelf

Oversampling Techniques for Bankruptcy Prediction: Novel Features from a Transaction Dataset

by Park, Jun , Le, Tuong , Lee, Mi in Bankruptcy , Datasets , Economic development

2018

In recent years, weakened by the fall of economic growth, many enterprises fell into the crisis caused by financial difficulties. Bankruptcy prediction, a machine learning model, is a great utility for financial institutions, fund managers, lenders, governments, and economic stakeholders. Due to the number of bankrupt companies compared to that of non-bankrupt companies, bankruptcy prediction faces the problem of imbalanced data. This study first presents the bankruptcy prediction framework. Then, five oversampling techniques are used to deal with imbalance problems on the experimental dataset which were collected from Korean companies in two years from 2016 to 2017. Experimental results show that using oversampling techniques to balance the dataset in the training stage can enhance the performance of the bankruptcy prediction. The best overall Area Under the Curve (AUC) of this framework can reach 84.2%. Next, the study extracts more features by combining the financial dataset with transaction dataset to increase the performance for bankruptcy prediction and achieves 84.4% AUC.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter