Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
153
result(s) for
"Oversampling technique"
Sort by:
An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques
2021
Insider threats are malicious acts that can be carried out by an authorized employee within an organization. Insider threats represent a major cybersecurity challenge for private and public organizations, as an insider attack can cause extensive damage to organization assets much more than external attacks. Most existing approaches in the field of insider threat focused on detecting general insider attack scenarios. However, insider attacks can be carried out in different ways, and the most dangerous one is a data leakage attack that can be executed by a malicious insider before his/her leaving an organization. This paper proposes a machine learning-based model for detecting such serious insider threat incidents. The proposed model addresses the possible bias of detection results that can occur due to an inappropriate encoding process by employing the feature scaling and one-hot encoding techniques. Furthermore, the imbalance issue of the utilized dataset is also addressed utilizing the synthetic minority oversampling technique (SMOTE). Well known machine learning algorithms are employed to detect the most accurate classifier that can detect data leakage events executed by malicious insiders during the sensitive period before they leave an organization. We provide a proof of concept for our model by applying it on CMU-CERT Insider Threat Dataset and comparing its performance with the ground truth. The experimental results show that our model detects insider data leakage events with an AUC-ROC value of 0.99, outperforming the existing approaches that are validated on the same dataset. The proposed model provides effective methods to address possible bias and class imbalance issues for the aim of devising an effective insider data leakage detection system.
Journal Article
Optimizing Rare Disease Gait Classification through Data Balancing and Generative AI: Insights from Hereditary Cerebellar Ataxia
2024
The interpretability of gait analysis studies in people with rare diseases, such as those with primary hereditary cerebellar ataxia (pwCA), is frequently limited by the small sample sizes and unbalanced datasets. The purpose of this study was to assess the effectiveness of data balancing and generative artificial intelligence (AI) algorithms in generating synthetic data reflecting the actual gait abnormalities of pwCA. Gait data of 30 pwCA (age: 51.6 ± 12.2 years; 13 females, 17 males) and 100 healthy subjects (age: 57.1 ± 10.4; 60 females, 40 males) were collected at the lumbar level with an inertial measurement unit. Subsampling, oversampling, synthetic minority oversampling, generative adversarial networks, and conditional tabular generative adversarial networks (ctGAN) were applied to generate datasets to be input to a random forest classifier. Consistency and explainability metrics were also calculated to assess the coherence of the generated dataset with known gait abnormalities of pwCA. ctGAN significantly improved the classification performance compared with the original dataset and traditional data augmentation methods. ctGAN are effective methods for balancing tabular datasets from populations with rare diseases, owing to their ability to improve diagnostic models with consistent explainability.
Journal Article
A Wearable Inertial Sensor Approach for Locomotion and Localization Recognition on Physical Activity
2024
Advancements in sensing technology have expanded the capabilities of both wearable devices and smartphones, which are now commonly equipped with inertial sensors such as accelerometers and gyroscopes. Initially, these sensors were used for device feature advancement, but now, they can be used for a variety of applications. Human activity recognition (HAR) is an interesting research area that can be used for many applications like health monitoring, sports, fitness, medical purposes, etc. In this research, we designed an advanced system that recognizes different human locomotion and localization activities. The data were collected from raw sensors that contain noise. In the first step, we detail our noise removal process, which employs a Chebyshev type 1 filter to clean the raw sensor data, and then the signal is segmented by utilizing Hamming windows. After that, features were extracted for different sensors. To select the best feature for the system, the recursive feature elimination method was used. We then used SMOTE data augmentation techniques to solve the imbalanced nature of the Extrasensory dataset. Finally, the augmented and balanced data were sent to a long short-term memory (LSTM) deep learning classifier for classification. The datasets used in this research were Real-World Har, Real-Life Har, and Extrasensory. The presented system achieved 89% for Real-Life Har, 85% for Real-World Har, and 95% for the Extrasensory dataset. The proposed system outperforms the available state-of-the-art methods.
Journal Article
AttGRU-HMSI: enhancing heart disease diagnosis using hybrid deep learning approach
2024
Heart disease is a major global cause of mortality and a major public health problem for a large number of individuals. A major issue raised by regular clinical data analysis is the recognition of cardiovascular illnesses, including heart attacks and coronary artery disease, even though early identification of heart disease can save many lives. Accurate forecasting and decision assistance may be achieved in an effective manner with machine learning (ML). Big Data, or the vast amounts of data generated by the health sector, may assist models used to make diagnostic choices by revealing hidden information or intricate patterns. This paper uses a hybrid deep learning algorithm to describe a large data analysis and visualization approach for heart disease detection. The proposed approach is intended for use with big data systems, such as Apache Hadoop. An extensive medical data collection is first subjected to an improved k-means clustering (IKC) method to remove outliers, and the remaining class distribution is then balanced using the synthetic minority over-sampling technique (SMOTE). The next step is to forecast the disease using a bio-inspired hybrid mutation-based swarm intelligence (HMSI) with an attention-based gated recurrent unit network (AttGRU) model after recursive feature elimination (RFE) has determined which features are most important. In our implementation, we compare four machine learning algorithms: SAE + ANN (sparse autoencoder + artificial neural network), LR (logistic regression), KNN (K-nearest neighbour), and naïve Bayes. The experiment results indicate that a 95.42% accuracy rate for the hybrid model's suggested heart disease prediction is attained, which effectively outperforms and overcomes the prescribed research gap in mentioned related work.
Journal Article
Strategy of oversampling geotechnical parameters through geostatistical, SMOTE, and CTGAN methods for assessing susceptibility of landslide
by
Min, Dae-Hong
,
Yoon, Hyung-Koo
,
Kim, Sewon
in
Algorithms
,
Confidence intervals
,
Data analysis
2024
The target slope is generally divided into grids to predict landslide susceptibility; however, it is difficult to acquire all geotechnical properties for each grid. The objective of this study is to examine oversampling characterization for each grid using geostatistical method and oversampling algorithms. Kriging, which is widely used in geotechnical engineering, is selected as a geostatistical method, and the synthetic minority oversampling technique (SMOTE) and conditional tabular generative adversarial network (CTGAN) are applied to perform oversampling as deep learning algorithms. The target area is divided into 900, 1800, 3600, 9000, 18,000, and 180,000 grids to determine the oversampling behavior for each grid size. The soil cohesion, slope angle, soil density, soil depth, and friction angle, which are input parameters in an infinite slope stability model, are measured through laboratory and field tests, and then the oversampling is performed. The distributions of oversampled data are analyzed with a comparison of mean and standard deviation, and the SMOTE showed a similar distribution with measured values at both 1800 and 3600 grids. Outlier analysis is also performed to suggest a reasonable confidence level for each input parameter, and the resolution of each geotechnical parameter is increased at the 5% confidence level. Finally, the mean absolute error (MAE) is reduced to around 62–69% and 41–43% for arithmetical mean and standard deviation. This study shows that not only kriging but also deep learning algorithms can be used when oversampling is required in the fields of geotechnical and geological engineering.
Journal Article
Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering
2024
The classification of imbalanced datasets is a prominent task in text mining and machine learning. The number of samples in each class is not uniformly distributed; one class contains a large number of samples while the other has a small number. Overfitting of the model occurs as a result of imbalanced datasets, resulting in poor performance. In this study, we compare different oversampling techniques like synthetic minority oversampling technique (SMOTE), support vector machine SMOTE (SVM-SMOTE), Border-line SMOTE, K-means SMOTE, and adaptive synthetic (ADASYN) oversampling to address the issue of imbalanced datasets and enhance the performance of machine learning models. Preprocessing significantly enhances the quality of input data by reducing noise, redundant data, and unnecessary data. This enables the machines to identify crucial patterns that facilitate the extraction of significant and pertinent information from the preprocessed data. This study preprocesses the data using various top-level preprocessing steps. Furthermore, two imbalanced Twitter datasets are used to compare the performance of oversampling techniques with six machine learning models including random forest (RF), SVM, K-nearest neighbor (KNN), AdaBoost (ADA), logistic regression (LR), and decision tree (DT). In addition, the bag of words (BoW) and term frequency and inverse document frequency (TF-IDF) features extraction approaches are used to extract features from the tweets. The experiments indicate that SMOTE and ADASYN perform much better than other techniques thus providing higher accuracy. Additionally, overall results show that SVM with ’linear’ kernel tends to attain the highest accuracy and recall score of 99.67% and 1.00% on ADASYN oversampled datasets and 99.57% accuracy on SMOTE oversampled dataset with TF-IDF features. The SVM model using 10-fold cross-validation experiments achieved 97.40 mean accuracy with a 0.008 standard deviation. Our approach achieved 2.62% greater accuracy as compared to other current methods.
Journal Article
Estimation of sexual dimorphism of adult human mandibles of South Indian origin using non-metric parameters and machine learning classification algorithms
2025
The mandible is one of the most reliable in sex determination in forensic anthropology. The shape of the mandible provides valuable information regarding the male and female distinctions. Machine learning algorithms are widely used for various applications due to their accuracy and reliability, extending their application in biological profiling. This study aims to estimate sexual dimorphism using various machine-learning algorithms based on non-metric features of the mandible. This study uses four machine-learning algorithms—k-nearest neighbors, decision tree, support vector machines, and random forest to determine sex based on 12 mandibular non-metric parameters. The data was collected from three medical institutes in Karnataka, India, involving a sample of 156 individuals. Random Forest consistently achieved the highest Jaccard Index (0.86), F1 score (0.92), and accuracy (0.92) across both SMOTE and Random Over-Sampling (ROS) methods, showing stable and robust performance. ROS improved balanced accuracy for KNN, Decision Tree, and SVM by up to 9.7%. Feature importance analysis highlighted N6 Gonial angle and N12 Flexure ramal post border as key predictors. Statistical tests found no significant accuracy differences among models. Female specificity remained lower across all models. This study offers insights into employing machine learning algorithms for sex identification using non-metric observations of the mandible.
Journal Article
DRG-Net: Diabetic Retinopathy Grading Network using Graph Learning with Extreme Gradient Boosting Classifier
2024
Diabetic retinopathy (DR) is a leading cause of blindness that occurs in different age groups. So, the early detection of DR can save millions of people from blindness issues. Further, the manual analysis of DR requires much processing time and experienced doctors. Hence, computer-aided diagnosis (CAD)-based artificial intelligence models have been developed for an early DR prediction. However, the state-of-the-art methodologies are failed to extract the deep balanced features, which resulted in poor classification performance. Therefore, this work implements the DR grading network (DRG-Net) using graph learning properties. Initially, synthetic minority over-sampling technique (SMOTE) is applied on EyePACS and Messidor dataset to balance the instances of each DR class into uniform level. Then, a deep graph correlation network (DGCN) is applied to extract the class-specific features by identifying the relationship. Finally, an extreme gradient boosting (XGBoost) classifier is employed to perform the DR classification with the pre-trained balanced features obtained using SMOTE-DGCN. The obtained simulation results performed on the EyePACS dataset and the Messidor dataset disclose that the proposed DRG-Net resulted in higher performance than state-of-the-art DR grading classification approaches, with accuracy, sensitivity, and specificity of 99.01%, 99.01%, and 98.43% for the EyePACS dataset, respectively, and 99.6%, 99.08%, and 100% for the Messidor dataset.
Journal Article
Detection of Electricity Theft Behavior Based on Improved Synthetic Minority Oversampling Technique and Random Forest Classifier
2020
Effective detection of electricity theft is essential to maintain power system reliability. With the development of smart grids, traditional electricity theft detection technologies have become ineffective to deal with the increasingly complex data on the users’ side. To improve the auditing efficiency of grid enterprises, a new electricity theft detection method based on improved synthetic minority oversampling technique (SMOTE) and improve random forest (RF) method is proposed in this paper. The data of normal and electricity theft users were classified as positive data (PD) and negative data (ND), respectively. In practice, the number of ND was far less than PD, which made the dataset composed of these two types of data become unbalanced. An improved SOMTE based on K-means clustering algorithm (K-SMOTE) was firstly presented to balance the dataset. The cluster center of ND was determined by K-means method. Then, the ND were interpolated by SMOTE on the basis of the cluster center to balance the entire data. Finally, the RF classifier was trained with the balanced dataset, and the optimal number of decision trees in RF was decided according to the convergence of out-of-bag data error (OOB error). Electricity theft behaviors on the user side were detected by the trained RF classifier.
Journal Article
A SMOTE PCA HDBSCAN approach for enhancing water quality classification in imbalanced datasets
by
Idris, Wan Mohd Razi
,
Nasaruddin, Norashikin
,
Masseran, Nurulkamal
in
639/705/531
,
704/172
,
Chemical oxygen demand
2025
Class imbalance poses a significant challenge in water quality classification, often leading to biased predictions and diminished accuracy for minority classes. This study introduces SMOTE-PCA-HDBSCAN, a novel oversampling framework that integrates the Synthetic Minority Oversampling Technique (SMOTE) to generate synthetic samples, Principal Component Analysis (PCA) to enhance data separability, and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to remove synthetic data noise. The cleaned synthetic data is then merged with the original dataset to form a balanced, noise-reduced training set. Comparative evaluations against SMOTE, SMOTE-DBSCAN, SMOTE-PCA-DBSCAN, SMOTE-ENN, and SMOTE-Tomek Links reveal that SMOTE-PCA-HDBSCAN consistently improves sensitivity for minority classes (Clean: 4.76% to 28.57%; Polluted: 38.09% to 61.90%) while maintaining high accuracy for the majority class. These results demonstrate the robustness of SMOTE-PCA-HDBSCAN in addressing class imbalance, offering a valuable tool for enhancing predictive models in environmental monitoring and other domains with imbalanced datasets.
Journal Article