Catalogue Search | MBRL

A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

by Kamalov, Firuz , Atiya, Amir F. , Elreedy, Dina in Artificial Intelligence , Computer Science , Control

2024

Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns’ probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically.

Journal Article

Share this book

Add to My Shelf

Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset

by Swana, Elsie Fezeka , Bokoro, Pitshou , Doorsamy, Wesley in Algorithms , Analysis , Bayes Theorem

2022

Data-driven methods have prominently featured in the progressive research and development of modern condition monitoring systems for electrical machines. These methods have the advantage of simplicity when it comes to the implementation of effective fault detection and diagnostic systems. Despite their many advantages, the practical implementation of data-driven approaches still faces challenges such as data imbalance. The lack of sufficient and reliable labeled fault data from machines in the field often poses a challenge in developing accurate supervised learning-based condition monitoring systems. This research investigates the use of a Naïve Bayes classifier, support vector machine, and k-nearest neighbors together with synthetic minority oversampling technique, Tomek link, and the combination of these two resampling techniques for fault classification with simulation and experimental imbalanced data. A comparative analysis of these techniques is conducted for different imbalanced data cases to determine the suitability thereof for condition monitoring on a wound-rotor induction generator. The precision, recall, and f1-score matrices are applied for performance evaluation. The results indicate that the technique combining the synthetic minority oversampling technique with the Tomek link provides the best performance across all tested classifiers. The k-nearest neighbors, together with this combination resampling technique yielded the most accurate classification results. This research is of interest to researchers and practitioners working in the area of condition monitoring in electrical machines, and the findings and presented approach of the comparative analysis will assist with the selection of the most suitable technique for handling imbalanced fault data. This is especially important in the practice of condition monitoring on electrical rotating machines, where fault data are very limited.

Journal Article

Share this book

Add to My Shelf

Enhancing Credit Card Fraud Detection: An Ensemble Machine Learning Approach

by Ashawa, Moses , Osamor, Jude , Adejoh, John in Algorithms , Classifiers , Computational linguistics

2024

In the era of digital advancements, the escalation of credit card fraud necessitates the development of robust and efficient fraud detection systems. This paper delves into the application of machine learning models, specifically focusing on ensemble methods, to enhance credit card fraud detection. Through an extensive review of existing literature, we identified limitations in current fraud detection technologies, including issues like data imbalance, concept drift, false positives/negatives, limited generalisability, and challenges in real-time processing. To address some of these shortcomings, we propose a novel ensemble model that integrates a Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Random Forest (RF), Bagging, and Boosting classifiers. This ensemble model tackles the dataset imbalance problem associated with most credit card datasets by implementing under-sampling and the Synthetic Over-sampling Technique (SMOTE) on some machine learning algorithms. The evaluation of the model utilises a dataset comprising transaction records from European credit card holders, providing a realistic scenario for assessment. The methodology of the proposed model encompasses data pre-processing, feature engineering, model selection, and evaluation, with Google Colab computational capabilities facilitating efficient model training and testing. Comparative analysis between the proposed ensemble model, traditional machine learning methods, and individual classifiers reveals the superior performance of the ensemble in mitigating challenges associated with credit card fraud detection. Across accuracy, precision, recall, and F1-score metrics, the ensemble outperforms existing models. This paper underscores the efficacy of ensemble methods as a valuable tool in the battle against fraudulent transactions. The findings presented lay the groundwork for future advancements in the development of more resilient and adaptive fraud detection systems, which will become crucial as credit card fraud techniques continue to evolve.

Journal Article

Share this book

Add to My Shelf

DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique

by Sinapiromsaran, Krung , Lursinsap, Chidchanok , Bunkhumpornpat, Chumphol

2012

Journal Article

Share this book

Add to My Shelf

Enhancing Network Intrusion Detection Using an Ensemble Voting Classifier for Internet of Things

by Farooqi, Ashfaq Hussain , Akhtar, Shahzaib , Abbass, Waseem in Accuracy , Algorithms , Analysis

2023

In the context of 6G technology, the Internet of Everything aims to create a vast network that connects both humans and devices across multiple dimensions. The integration of smart healthcare, agriculture, transportation, and homes is incredibly appealing, as it allows people to effortlessly control their environment through touch or voice commands. Consequently, with the increase in Internet connectivity, the security risk also rises. However, the future is centered on a six-fold increase in connectivity, necessitating the development of stronger security measures to handle the rapidly expanding concept of IoT-enabled metaverse connections. Various types of attacks, often orchestrated using botnets, pose a threat to the performance of IoT-enabled networks. Detecting anomalies within these networks is crucial for safeguarding applications from potentially disastrous consequences. The voting classifier is a machine learning (ML) model known for its effectiveness as it capitalizes on the strengths of individual ML models and has the potential to improve overall predictive performance. In this research, we proposed a novel classification technique based on the DRX approach that combines the advantages of the Decision tree, Random forest, and XGBoost algorithms. This ensemble voting classifier significantly enhances the accuracy and precision of network intrusion detection systems. Our experiments were conducted using the NSL-KDD, UNSW-NB15, and CIC-IDS2017 datasets. The findings of our study show that the DRX-based technique works better than the others. It achieved a higher accuracy of 99.88% on the NSL-KDD dataset, 99.93% on the UNSW-NB15 dataset, and 99.98% on the CIC-IDS2017 dataset, outperforming the other methods. Additionally, there is a notable reduction in the false positive rates to 0.003, 0.001, and 0.00012 for the NSL-KDD, UNSW-NB15, and CIC-IDS2017 datasets.

Journal Article

Share this book

Add to My Shelf

A comprehensive evaluation of oversampling techniques for enhancing text classification performance

by Taskiran, Salimkan Fatma , Turkoglu, Bahaeddin , Kaya, Ersin in 639/705/1042 , 639/705/117 , Humanities and Social Sciences

2025

Class imbalance is a common and critical challenge in text classification tasks, where the underrepresentation of certain classes often impairs the ability of classifiers to learn minority class patterns effectively. According to the “garbage in, garbage out” principle, even high-performing models may fail when trained on skewed distributions. To address this issue, this study investigates the impact of oversampling techniques, specifically the Synthetic Minority Over-sampling Technique (SMOTE) and thirty of its variants, on two benchmark text classification datasets: TREC and Emotions. Each dataset was vectorized using the MiniLMv2 transformer model to obtain semantically rich representations, and classification was performed using six machine learning algorithms. The balanced and imbalanced scenarios were compared in terms of F1-Score and Balanced Accuracy. This work constitutes, to the best of our knowledge, the first large-scale, systematic benchmarking of SMOTE-based oversampling methods in the context of transformer-embedded text classification. Furthermore, statistical significance of the observed performance differences was validated using the Friedman test. The results provide practical insights into the selection of oversampling techniques tailored to dataset characteristics and classifier sensitivity, supporting more robust and fair learning in imbalanced natural language processing tasks.

Journal Article

Share this book

Add to My Shelf

LVQ-SMOTE – Learning Vector Quantization based Synthetic Minority Over–sampling Technique for biomedical data

by Atsushi Otsuka , Yusuke Kajiwara , Munehiro Nakamura in Algorithms , Analysis , Biochemistry

2013

Background Over-sampling methods based on Synthetic Minority Over-sampling Technique (SMOTE) have been proposed for classification problems of imbalanced biomedical data. However, the existing over-sampling methods achieve slightly better or sometimes worse result than the simplest SMOTE. In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization. In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes. To tackle this problem, our over-sampling method generates synthetic samples which occupy more feature space than the other SMOTE algorithms. Briefly saying, our over-sampling method enables to generate useful synthetic samples by referring to actual samples taken from real-world datasets. Results Experiments on eight real-world imbalanced datasets demonstrate that our proposed over-sampling method performs better than the simplest SMOTE on four of five standard classification algorithms. Moreover, it is seen that the performance of our method increases if the latest SMOTE called MWMOTE is used in our algorithm. Experiments on datasets for β -turn types prediction show some important patterns that have not been seen in previous analyses. Conclusions The proposed over-sampling method generates useful synthetic samples for the classification of imbalanced biomedical data. Besides, the proposed over-sampling method is basically compatible with basic classification algorithms and the existing over-sampling methods.

Journal Article

Share this book

Add to My Shelf

Water quality prediction: a data-driven approach exploiting advanced machine learning algorithms with data augmentation

by K, Karthick , Manikandan, R. , Krishnan, S. in machine learning , principal component analysis (pca) , synthetic minority over-sampling technique (smote)

2024

Water quality assessment plays a crucial role in various aspects, including human health, environmental impact, agricultural productivity, and industrial processes. Machine learning (ML) algorithms offer the ability to automate water quality evaluation and allow for effective and rapid assessment of parameters associated with water quality. This article proposes an ML-based classification model for water quality prediction. The model was tested with 14 ML algorithms and considers 20 features that represent various substances present in water samples and their concentrations. The dataset used in the study comprises 7,996 samples, and the model development involves several stages, including data preprocessing, Yeo–Johnson transformation for data normalization, principal component analysis (PCA) for feature selection, and the application of the synthetic minority over-sampling technique (SMOTE) to address class imbalance. Performance metrics, such as accuracy, precision, recall, and F1 score, are provided for each algorithm with and without SMOTE. LightGBM, XGBoost, CatBoost, and Random Forest were identified as the best-performing algorithms. XGBoost achieved the highest accuracy of 96.31% without SMOTE and had a precision of 0.933. The application of SMOTE enhanced the performance of CatBoost. These findings provide valuable insights for ML-based water quality assessment, aiding researchers and professionals in decision-making and management.

Journal Article

Share this book

Add to My Shelf

SORAG: Synthetic Data Over-Sampling Strategy on Multi-Label Graphs

by Kim, Kyoung-Sook , Duan, Yijun , Yu, Hai-tao in Algorithms , Best practice , Classification

2022

In many real-world networks of interest in the field of remote sensing (e.g., public transport networks), nodes are associated with multiple labels, and node classes are imbalanced; that is, some classes have significantly fewer samples than others. However, the research problem of imbalanced multi-label graph node classification remains unexplored. This non-trivial task challenges the existing graph neural networks (GNNs) because the majority class can dominate the loss functions of GNNs and result in the overfitting of the majority class features and label correlations. On non-graph data, minority over-sampling methods (such as the synthetic minority over-sampling technique and its variants) have been demonstrated to be effective for the imbalanced data classification problem. This study proposes and validates a new hypothesis with unlabeled data over-sampling, which is meaningless for imbalanced non-graph data; however, feature propagation and topological interplay mechanisms between graph nodes can facilitate the representation learning of imbalanced graphs. Furthermore, we determine empirically that ensemble data synthesis through the creation of virtual minority samples in the central region of a minority and generation of virtual unlabeled samples in the boundary region between a minority and majority is the best practice for the imbalanced multi-label graph node classification task. Our proposed novel data over-sampling framework is evaluated using multiple real-world network datasets, and it outperforms diverse, strong benchmark models by a large margin.

Journal Article

Share this book

Add to My Shelf

A Multi-Class Intrusion Detection System for DDoS Attacks in IoT Networks Using Deep Learning and Transformers

by Tariq, Noshina , Wahab, Sheikh Abdul , Mylonas, Alexios in Accuracy , Automation , Connectivity

2025

The rapid proliferation of Internet of Things (IoT) devices has significantly increased vulnerability to Distributed Denial of Service (DDoS) attacks, which can severely disrupt network operations. DDoS attacks in IoT networks disrupt communication and compromise service availability, causing severe operational and economic losses. In this paper, we present a Deep Learning (DL)-based Intrusion Detection System (IDS) tailored for IoT environments. Our system employs three architectures—Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), and Transformer-based models—to perform binary, three-class, and 12-class classification tasks on the CiC IoT 2023 dataset. Data preprocessing includes log normalization to stabilize feature distributions and SMOTE-based oversampling to mitigate class imbalance. Experiments on the CIC-IoT 2023 dataset show that, in the binary classification task, the DNN achieved 99.2% accuracy, the CNN 99.0%, and the Transformer 98.8%. In three-class classification (benign, DDoS, and non-DDoS), all models attained near-perfect performance (approximately 99.9–100%). In the 12-class scenario (benign plus 12 attack types), the DNN, CNN, and Transformer reached 93.0%, 92.7%, and 92.5% accuracy, respectively. The high precision, recall, and ROC-AUC values corroborate the efficacy and generalizability of our approach for IoT DDoS detection. Comparative analysis indicates that our proposed IDS outperforms state-of-the-art methods in terms of detection accuracy and efficiency. These results underscore the potential of integrating advanced DL models into IDS frameworks, thereby providing a scalable and effective solution to secure IoT networks against evolving DDoS threats. Future work will explore further enhancements, including the use of deeper Transformer architectures and cross-dataset validation, to ensure robustness in real-world deployments.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter