Catalogue Search | MBRL

Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost

by Liew, Soung Yue , Choo, Peng Yin , Hou, Guodong in Accuracy , ADASYN , Algorithms

2025

One of the key challenges in financial distress data is class imbalance, where the data are characterized by a highly imbalanced ratio between the number of distressed and non-distressed samples. This study examines eight resampling techniques for improving distress prediction using the XGBoost algorithm. The study was performed on a dataset acquired from the CSMAR database, containing 26,383 firm-quarter samples from 639 Chinese A-share listed companies (2007–2024), with only 12.1% of the cases being distressed. Results show that standard Synthetic Minority Oversampling Technique (SMOTE) enhanced F1-score (up to 0.73) and Matthews Correlation Coefficient (MCC, up to 0.70), while SMOTE-Tomek and Borderline-SMOTE further boosted recall, slightly sacrificing precision. These oversampling and hybrid methods also maintained reasonable computational efficiency. However, Random Undersampling (RUS), though yielding high recall (0.85), suffered from low precision (0.46) and weaker generalization, but was the fastest method. Among all techniques, Bagging-SMOTE achieved balanced performance (AUC 0.96, F1 0.72, PR-AUC 0.80, MCC 0.68) using a minority-to-majority ratio of 0.15, demonstrating that ensemble-based resampling can improve robustness with minimal impact on the original class distribution, albeit with higher computational cost. The compared findings highlight that no single approach fits all use cases, and technique selection should align with specific goals. Techniques favoring recall (e.g., Bagging-SMOTE, SMOTE-Tomek) are suited for early warning, while conservative techniques (e.g., Tomek Links) help reduce false positives in risk-sensitive applications, and efficient methods such as RUS are preferable when computational speed is a priority.

Journal Article

Share this book

Add to My Shelf

A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining

by Wongvorachan, Tarid , He, Surina , Bulut, Okan in Accuracy , Algorithms , Bias

2023

Educational data mining is capable of producing useful data-driven applications (e.g., early warning systems in schools or the prediction of students’ academic achievement) based on predictive models. However, the class imbalance problem in educational datasets could hamper the accuracy of predictive models as many of these models are designed on the assumption that the predicted class is balanced. Although previous studies proposed several methods to deal with the imbalanced class problem, most of them focused on the technical details of how to improve each technique, while only a few focused on the application aspect, especially for the application of data with different imbalance ratios. In this study, we compared several sampling techniques to handle the different ratios of the class imbalance problem (i.e., moderately or extremely imbalanced classifications) using the High School Longitudinal Study of 2009 dataset. For our comparison, we used random oversampling (ROS), random undersampling (RUS), and the combination of the synthetic minority oversampling technique for nominal and continuous (SMOTE-NC) and RUS as a hybrid resampling technique. We used the Random Forest as our classification algorithm to evaluate the results of each sampling technique. Our results show that random oversampling for moderately imbalanced data and hybrid resampling for extremely imbalanced data seem to work best. The implications for educational data mining applications and suggestions for future research are discussed.

Journal Article

Share this book

Add to My Shelf

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

by Villar, Monica Gracia , Kına, EROL , Mujahid, Muhammad in Accuracy , Big Data , Classification

2024

The classification of imbalanced datasets is a prominent task in text mining and machine learning. The number of samples in each class is not uniformly distributed; one class contains a large number of samples while the other has a small number. Overfitting of the model occurs as a result of imbalanced datasets, resulting in poor performance. In this study, we compare different oversampling techniques like synthetic minority oversampling technique (SMOTE), support vector machine SMOTE (SVM-SMOTE), Border-line SMOTE, K-means SMOTE, and adaptive synthetic (ADASYN) oversampling to address the issue of imbalanced datasets and enhance the performance of machine learning models. Preprocessing significantly enhances the quality of input data by reducing noise, redundant data, and unnecessary data. This enables the machines to identify crucial patterns that facilitate the extraction of significant and pertinent information from the preprocessed data. This study preprocesses the data using various top-level preprocessing steps. Furthermore, two imbalanced Twitter datasets are used to compare the performance of oversampling techniques with six machine learning models including random forest (RF), SVM, K-nearest neighbor (KNN), AdaBoost (ADA), logistic regression (LR), and decision tree (DT). In addition, the bag of words (BoW) and term frequency and inverse document frequency (TF-IDF) features extraction approaches are used to extract features from the tweets. The experiments indicate that SMOTE and ADASYN perform much better than other techniques thus providing higher accuracy. Additionally, overall results show that SVM with ’linear’ kernel tends to attain the highest accuracy and recall score of 99.67% and 1.00% on ADASYN oversampled datasets and 99.57% accuracy on SMOTE oversampled dataset with TF-IDF features. The SVM model using 10-fold cross-validation experiments achieved 97.40 mean accuracy with a 0.008 standard deviation. Our approach achieved 2.62% greater accuracy as compared to other current methods.

Journal Article

Share this book

Add to My Shelf

Interpretable Machine Learning Framework for Early Depression Detection Using Socio-Demographic Features with Dual Feature Selection and SMOTE

by Amnai, Mohamed , Sabouri, Zineb , Moustati, Imane

2025

Depression is the most widespread psychological disorder globally, impacting individuals across all age groups; when left undiagnosed or untreated, it significantly elevates the risk of severe outcomes, including suicidality. This study explores the efficacy of eight machine learning (ML) classifiers utilizing socio-demographic and psychosocial data to discern signs of depression. A depression dataset available on GitHub was acquired, comprising 604 instances with 30 predictors and 1 target variable indicating depression status. Preprocessing included normalization, handling missing values, and encoding categorical variables. Two feature selection methodologies, Analysis of Variance (ANOVA) and Boruta were employed to extract pertinent features. ANOVA selected 19 features, while Boruta retained 13 for model training. To address class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) was utilized to enhance prediction accuracy (ACC). Results demonstrate that Logistic Regression (LR), combined with ANOVA feature selection, exhibits superior performance, achieving an ACC of 92.56% and an AUC of 92.69%. With Boruta, LR achieved an ACC of 91.74% and an AUC of 91.65%. Without feature selection, LR yielded an ACC of 87.75%, a precision of 91.73%, and an AUC of 89.98%. SHapley Additive exPlanations (SHAP) analysis revealed that anxiety (ANXI) is the most influential predictor within the ML model designed for depression prediction. This study identifies the most effective model for predicting depression through evaluation metrics, while also addressing societal biases and supporting clinicians with interpretable insights for early intervention.

Journal Article

Share this book

Add to My Shelf

Hybrid Churn Prediction Model Using SMOTE SVM Resampling and Genetic Algorithm Optimized ELM

by Chen, Xuan

2025

Consumer churn prediction and early warning have emerged as a major area of study in enterprise customer relationship management due to the advancement of big data technologies and intelligent decision-making systems. To improve the accuracy and stability of the churn prediction model, a novel resampling algorithm based on support vector machines and the synthetic minority class oversampling technique was designed. This technique alleviates the category imbalance problem by extracting support vector boundary samples and using the synthetic minority class oversampling technique to interpolate new samples in their neighborhoods. Next, a genetic algorithm is introduced to optimize the input weights and hidden layer bias of the extreme learning machine globally. Then, a classifier is constructed to improve the model's generalization performance and convergence stability. Finally, the resampling algorithm is integrated with the classifier to construct a complete consumer churn prediction model. The suggested churn prediction model had the highest mean average precision and F1 value of 98.5% and 0.98 on the training set, according to the performance test results. The lowest mean square error was 0.025 and 0.028 for the training set and the test set, respectively. The results of practical application tests indicated that the prediction accuracy of the proposed prediction model in five typical datasets was up to 93.17%, and the shortest average prediction time was only 0.83 seconds. In summary, the suggested methodology can successfully raise enterprise churn customer identification's accuracy and real-time performance. It provides powerful support for intelligent customer management and precise marketing decision-making in practical application scenarios.

Journal Article

Share this book

Add to My Shelf

A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

by Kamalov, Firuz , Atiya, Amir F. , Elreedy, Dina in Artificial Intelligence , Computer Science , Control

2024

Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns’ probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically.

Journal Article

Share this book

Add to My Shelf

The effect of feature extraction and data sampling on credit card fraud detection

by Leevy, Joffrey L , Khoshgoftaar, Taghi M , Salekshahrezaee, Zahra in Algorithms , Big Data , Classification

2023

Training a machine learning algorithm on a class-imbalanced dataset can be a difficult task, a process that could prove even more challenging under conditions of high dimensionality. Feature extraction and data sampling are among the most popular preprocessing techniques. Feature extraction is used to derive a richer set of reduced dataset features, while data sampling is used to mitigate class imbalance. In this paper, we investigate these two preprocessing techniques, using a credit card fraud dataset and four ensemble classifiers (Random Forest, CatBoost, LightGBM, and XGBoost). Within the context of feature extraction, the Principal Component Analysis (PCA) and Convolutional Autoencoder (CAE) methods are evaluated. With regard to data sampling, the Random Undersampling (RUS), Synthetic Minority Oversampling Technique (SMOTE), and SMOTE Tomek methods are evaluated. The F1 score and Area Under the Receiver Operating Characteristic Curve (AUC) metrics serve as measures of classification performance. Our results show that the implementation of the RUS method followed by the CAE method leads to the best performance for credit card fraud detection.

Journal Article

Share this book

Add to My Shelf

Assessing Gradient-Boosting Techniques for Customer Churn Prediction: A Statistical Perspective Using SMOTE-ENN and SMOTE-Tomek

by Abdalla, Faroug A. , Satty, Ali , Mahmoud, Ashraf F. A.

2025

Journal Article

Share this book

Add to My Shelf

Anomaly detection in IoT-based healthcare: machine learning for enhanced security

by Khan, Maryam Mahsal , Alkhathami, Mohammed in 639/705/117 , 639/705/258 , Academies and Institutes

2024

Internet of Things (IoT) integration in healthcare improves patient care while also making healthcare delivery systems more effective and economical. To fully realize the advantages of IoT in healthcare, it is imperative to overcome issues with data security, interoperability, and ethical considerations. IoT sensors periodically measure the health-related data of the patients and share it with a server for further evaluation. At the server, different machine learning algorithms are applied which help in early diagnosis of diseases and issue alerts in case vital signs are out of the normal range. Different cyber attacks can be launched on IoT devices which can result in compromised security and privacy of applications such as health care. In this paper, we utilize the publicly available Canadian Institute for Cybersecurity (CIC) IoT dataset to model machine learning techniques for efficient detection of anomalous network traffic. The dataset consists of 33 types of IoT attacks which are divided into 7 main categories. In the current study, the dataset is pre-processed, and a balanced representation of classes is used in generating a non-biased supervised (Random Forest, Adaptive Boosting, Logistic Regression, Perceptron, Deep Neural Network) machine learning models. These models are analyzed further by eliminating highly correlated features, reducing dimensionality, minimizing overfitting, and speeding up training times. Random Forest was found to perform optimally across binary and multiclass classification of IoT Attacks with an approximate accuracy of 99.55% under both reduced and all feature space. This improvement was complimented by a reduction in computational response time which is essential for real-time attack detection and response.

Journal Article

Share this book

Add to My Shelf

Enhancing liver disease diagnosis with hybrid SMOTE-ENN balanced machine learning models—an empirical analysis of Indian patient liver disease datasets

by Lipika , Rani, Ritu , Diwakar, Manoj in Accuracy , Algorithms , Artificial intelligence

2025

The liver is one of the vital organs of human body that performs some of the most crucial biological processes such as protein and biochemical synthesis, which is required for digestion and cleansing. A large number of patients are suffering from liver disease and hence it has become a life-threatening issue around the world. Annually, around 2 million people die because of liver disease, this accounts for around 4% of all deaths, due to factors like obesity, undiagnosed hepatitis, and excessive alcohol consumption. These factors accumulate and deteriorate the liver condition. Immediate action is necessary for timely diagnosis of the ailment before irreversible damage is done. The work aims to evaluate some of the traditional and prominent machine learning algorithms, namely, Logistic Regression, K-Nearest Neighbor, Support Vector Machine, Gaussian Naïve Bayes, Decision Tree, Random Forest, AdaBoost, Extreme Gradient Boosting, and Light GBM for diagnosing and predicting chronic liver disease. Also, real-world datasets often have imbalanced class distributions, causing classifiers to perform poorly, leading to low accuracy, precision, recall values and high misclassification. The Indian Patient Liver Disease (ILPD) datasets also face an imbalance issue. This work presents two hybrid models, namely SMOTEENN-KNN and SMOTEENN-AdaBoost, which can robustly handle the problem of imbalance in real-world datasets, in addition to improving the accuracy of liver disease prediction. We have also designed a hybrid model which involves the combination of Recursive Feature Elimination (RFE) for feature selection, SMOTE-ENN to tackle the problem of data imbalance and Ensemble learning for enhanced predictions. The research work also proposed Hybrid Ensemble model on the ILPD and BUPA Liver Disorder Dataset. For the ILPD dataset, the model achieves an overall accuracy of 93.2% whereas for the BUPA dataset, the model attains an accuracy of 95.4%. The Brier Score loss for ILPD dataset is 0.032 and 0.031 for the BUPA Liver Disorder Dataset. The research work highlights the potential of data balancing techniques and Ensemble models to improve predictive accuracy in liver disease diagnosis.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter