Catalogue Search | MBRL

Machine learning-based network intrusion detection for big and imbalanced data using oversampling, stacking feature embedding and feature extraction

by Talukder, Md. Alamin , Moni, Mohammad Ali , Uddin, Md Ashraf in Abnormalities , Accuracy , Applied behavior analysis

2024

Cybersecurity has emerged as a critical global concern. Intrusion Detection Systems (IDS) play a critical role in protecting interconnected networks by detecting malicious actors and activities. Machine Learning (ML)-based behavior analysis within the IDS has considerable potential for detecting dynamic cyber threats, identifying abnormalities, and identifying malicious conduct within the network. However, as the number of data grows, dimension reduction becomes an increasingly difficult task when training ML models. Addressing this, our paper introduces a novel ML-based network intrusion detection model that uses Random Oversampling (RO) to address data imbalance and Stacking Feature Embedding based on clustering results, as well as Principal Component Analysis (PCA) for dimension reduction and is specifically designed for large and imbalanced datasets. This model’s performance is carefully evaluated using three cutting-edge benchmark datasets: UNSW-NB15, CIC-IDS-2017, and CIC-IDS-2018. On the UNSW-NB15 dataset, our trials show that the RF and ET models achieve accuracy rates of 99.59% and 99.95%, respectively. Furthermore, using the CIC-IDS2017 dataset, DT, RF, and ET models reach 99.99% accuracy, while DT and RF models obtain 99.94% accuracy on CIC-IDS2018. These performance results continuously outperform the state-of-art, indicating significant progress in the field of network intrusion detection. This achievement demonstrates the efficacy of the suggested methodology, which can be used practically to accurately monitor and identify network traffic intrusions, thereby blocking possible threats.

Journal Article

Share this book

Add to My Shelf

Estimation of sexual dimorphism of adult human mandibles of South Indian origin using non-metric parameters and machine learning classification algorithms

by Ramos, Amith , Zuber, Mohammad , Pandey, Akhilesh Kumar in 639/166 , 692/308 , 692/698

2025

The mandible is one of the most reliable in sex determination in forensic anthropology. The shape of the mandible provides valuable information regarding the male and female distinctions. Machine learning algorithms are widely used for various applications due to their accuracy and reliability, extending their application in biological profiling. This study aims to estimate sexual dimorphism using various machine-learning algorithms based on non-metric features of the mandible. This study uses four machine-learning algorithms—k-nearest neighbors, decision tree, support vector machines, and random forest to determine sex based on 12 mandibular non-metric parameters. The data was collected from three medical institutes in Karnataka, India, involving a sample of 156 individuals. Random Forest consistently achieved the highest Jaccard Index (0.86), F1 score (0.92), and accuracy (0.92) across both SMOTE and Random Over-Sampling (ROS) methods, showing stable and robust performance. ROS improved balanced accuracy for KNN, Decision Tree, and SVM by up to 9.7%. Feature importance analysis highlighted N6 Gonial angle and N12 Flexure ramal post border as key predictors. Statistical tests found no significant accuracy differences among models. Female specificity remained lower across all models. This study offers insights into employing machine learning algorithms for sex identification using non-metric observations of the mandible.

Journal Article

Share this book

Add to My Shelf

Application of Machine Learning to Predict COVID-19 Spread via an Optimized BPSO Model

by Assiri, Sara Ahmad , Elshewey, Ahmed M. , Hadjouni, Myriam in Algorithms , Altitude , binary particle swarm optimization

2023

During the pandemic of the coronavirus disease (COVID-19), statistics showed that the number of affected cases differed from one country to another and also from one city to another. Therefore, in this paper, we provide an enhanced model for predicting COVID-19 samples in different regions of Saudi Arabia (high-altitude and sea-level areas). The model is developed using several stages and was successfully trained and tested using two datasets that were collected from Taif city (high-altitude area) and Jeddah city (sea-level area) in Saudi Arabia. Binary particle swarm optimization (BPSO) is used in this study for making feature selections using three different machine learning models, i.e., the random forest model, gradient boosting model, and naive Bayes model. A number of predicting evaluation metrics including accuracy, training score, testing score, F-measure, recall, precision, and receiver operating characteristic (ROC) curve were calculated to verify the performance of the three machine learning models on these datasets. The experimental results demonstrated that the gradient boosting model gives better results than the random forest and naive Bayes models with an accuracy of 94.6% using the Taif city dataset. For the dataset of Jeddah city, the results demonstrated that the random forest model outperforms the gradient boosting and naive Bayes models with an accuracy of 95.5%. The dataset of Jeddah city achieved better results than the dataset of Taif city in Saudi Arabia using the enhanced model for the term of accuracy.

Journal Article

Share this book

Add to My Shelf

Prognoza: Parkinson’s Disease Prediction Using Classification Algorithms

by Shivakoti, Mithun , Medaramatla, Sai Charan , Godavarthi, Deepthi in Datasets , Forecasting , Machine learning

2024

Parkinson's Disease (PD) is a persistent neurological condition that has a global impact on a significant number of individuals. The timely detection of PD is imperative for the efficacious treatment and control of the condition. Machine learning (ML) methods have demonstrated significant potential in forecasting Parkinson's disease (PD) based on diverse data sources in recent times. The present research paper outlines a study that employs machine learning [ML]techniques to predict Parkinson's disease. A dataset comprising clinical and demographic characteristics of both patients diagnosed with PD and healthy individuals was taken from Kaggle. The aforementioned dataset was utilized to train and assess multiple machine learning models. The experimental findings indicate that the CatBoost model exhibited superior performance compared to the other models, achieving an accuracy rate of 95.1% and a root mean squared error of of 0.34.In summary, our research showcases the capabilities of machine learning methodologies in forecasting Parkinson's disease and offers valuable insights into the crucial predictors for PD prognosis. The results of our study could potentially contribute to the advancement of diagnostic methods for the timely identification of PD, with increased precision and efficacy.

Journal Article

Share this book

Add to My Shelf

Random Oversampling-Based Diabetes Classification via Machine Learning Algorithms

by Eunice, R. Jennifer , Kanaga, E. Grace Mary , Andrew, J. in Artificial Intelligence , Boruta technique , Computational Intelligence

2024

Diabetes mellitus is considered one of the main causes of death worldwide. If diabetes fails to be treated and diagnosed earlier, it can cause several other health problems, such as kidney disease, nerve disease, vision problems, and brain issues. Early detection of diabetes reduces healthcare costs and minimizes the chance of serious complications. In this work, we propose an e-diagnostic model for diabetes classification via a machine learning algorithm that can be executed on the Internet of Medical Things (IoMT). The study uses and analyses two benchmarking datasets, the PIMA Indian Diabetes Dataset (PIDD) and the Behavioral Risk Factor Surveillance System (BRFSS) diabetes dataset, to classify diabetes. The proposed model consists of the random oversampling method to balance the range of classes, the interquartile range technique-based outlier detection to eliminate outlier data, and the Boruta algorithm for selecting the optimal features from the datasets. The proposed approach considers ML algorithms such as random forest, gradient boosting models, light gradient boosting classifiers, and decision trees, as they are widely used classification algorithms for diabetes prediction. We evaluated all four ML algorithms via performance indicators such as accuracy, F 1 score, recall, precision, and AUC-ROC. Comparative analysis of this model suggests that the random forest algorithm outperforms all the remaining classifiers, with the greatest accuracy of 92% on the BRFSS diabetes dataset and 94% accuracy on the PIDD dataset, which is greater than the 3% accuracy reported in existing research. This research is helpful for assisting diabetologists in developing accurate treatment regimens for patients who are diabetic.

Journal Article

Share this book

Add to My Shelf

An Intrusion Detection System Based on a Simplified Residual Network

by Xiao, Yuelei , Xiao, Xing in Accuracy , Datasets , Decision trees

2019

Residual networks (ResNets) are prone to over-fitting for low-dimensional and small-scale datasets. And the existing intrusion detection systems (IDSs) fail to provide better performance, especially for remote-to-local (R2L) and user-to-root (U2R) attacks. To overcome these problems, a simplified residual network (S-ResNet) is proposed in this paper, which consists of several cascaded, simplified residual blocks. Compared with the original residual block, the simplified residual block deletes a weight layer and two batch normalization (BN) layers, adds a pooling layer, and replaces the rectified linear unit (ReLU) function with the parametric rectified linear unit (PReLU) function. Based on the S-ResNet, a novel IDS was proposed in this paper, which includes a data preprocessing module, a random oversampling module, a S-Resnet layer, a full connection layer and a Softmax layer. The experimental results on the NSL-KDD dataset show that the IDS based on the S-ResNet has a higher accuracy, recall and F1-score than the equal scale ResNet-based IDS, especially for R2L and U2R attacks. And the former has faster convergence velocity than the latter. It proves that the S-ResNet reduces the complexity of the network and effectively prevents over-fitting; thus, it is more suitable for low-dimensional and small-scale datasets than ResNet. Furthermore, the experimental results on the NSL-KDD datasets also show that the IDS based on the S-ResNet achieves better performance in terms of accuracy and recall compared to the existing IDSs, especially for R2L and U2R attacks.

Journal Article

Share this book

Add to My Shelf

Determining Resampling Ratios Using BSMOTE and SVM-SMOTE for Identifying Rare Attacks in Imbalanced Cybersecurity Data

by Bagui, Subhash C. , Subramaniam, Sakthivel , Mink, Dustin in Algorithms , BSMOTE , Classification

2023

Machine Learning is widely used in cybersecurity for detecting network intrusions. Though network attacks are increasing steadily, the percentage of such attacks to actual network traffic is significantly less. And here lies the problem in training Machine Learning models to enable them to detect and classify malicious attacks from routine traffic. The ratio of actual attacks to benign data is significantly high and as such forms highly imbalanced datasets. In this work, we address this issue using data resampling techniques. Though there are several oversampling and undersampling techniques available, how these oversampling and undersampling techniques are most effectively used is addressed in this paper. Two oversampling techniques, Borderline SMOTE and SVM-SMOTE, are used for oversampling minority data and random undersampling is used for undersampling majority data. Both the oversampling techniques use KNN after selecting a random minority sample point, hence the impact of varying KNN values on the performance of the oversampling technique is also analyzed. Random Forest is used for classification of the rare attacks. This work is done on a widely used cybersecurity dataset, UNSW-NB15, and the results show that 10% oversampling gives better results for both BMSOTE and SVM-SMOTE.

Journal Article

Share this book

Add to My Shelf

4Gbaud PS-16QAM D-Band Fiber-Wireless Transmission over 4.6 km by Using Balance Complex-Valued NN Equalizer with Random Oversampling

by Xie, Tangyao , Yu, Jianguo in Algorithms , balanced random oversampling , Classification

2023

D-band (110–170 GHz) is a promising direction for the future of 6th generation mobile networks (6G) for high-speed mobile communication since it has a large available bandwidth, and it can provide a peak rate of hundreds of Gbit/s. Compared with the traditional electrical approach, photonics millimeter wave (mm-wave) generation in D-band is more practical and effectively overcomes the bottleneck of electrical devices. However, long-distance D-band wireless transmission is still limited by some key factors such as large absorption loss and nonlinear noises. Deep neural network algorithms are regarded as an important technique to model the nonlinear wireless behavior, among which the study on complex-value equalization is critical, especially in coherent detection systems. Moreover, probabilistic shaping is useful to improve the transmission capacity but also causes an imbalanced machine learning issue. In this paper, we propose a novel complex-valued neural network equalizer coupled with balanced random oversampling (ROS). Thanks to the adaptive deep learning method for probabilistic shaping-quadrature amplitude modulation (PS-QAM), we successfully realize a 135 GHz 4Gbaud PS-16QAM with a shaping entropy of 3.56 bit/symbol wireless transmission over 4.6 km. The bit error ratio (BER) of 4Gbaud PS-16QAM can be decreased to a soft-decision forward error correction (SD-FEC) with a 25% overhead of 2 × 10−2. Therefore, we can achieve a net rate of an 11.4 Gbit/s D-band radio-over-fiber (ROF) delivery over 4.6 km air free wireless distance.

Journal Article

Share this book

Add to My Shelf

Imbalanced data handling in multiclass distributed denial of service attack detection using deep learning

by Khamis, Nurulaqilla , Gunawan, Rahmad , Fu’adah Amran, Hasanatul in Accuracy , Algorithms , Assaults

2024

In data analysis, imbalanced datasets are a frequent issue, where classes in a dataset have an uneven distribution, which can lead to poor performance in machine learning (ML) and predictive modeling. In this study, we analyze distributed denial of service (DDoS) attacks at the application layer. Three primary strategies are studied in this study to address the issue of data imbalance in multiclass techniques: random oversampling (ROS), random undersampling (RUS), and the use of class weights. A model using a deep learning (DL) technique has been proposed in this paper to be trained and tested for DDoS attack detection. Based on the results obtained and presented in this paper, it is observed that RUS outperforms class-weight and ROS in multiclass settings in terms of resolving imbalanced data when implemented with the deep learning-based DDoS attack detection model.

Journal Article

Share this book

Add to My Shelf

Explainable Machine Learning for Efficient Diabetes Prediction Using Hyperparameter Tuning, SHAP Analysis, Partial Dependency, and LIME

by Shahid, Md. Shamim Bin , Rifat, Habibur Rahman , Uddin, Khandaker Mohammad Mohi in Accuracy , Algorithms , Classification

2025

Diabetes is a chronic metabolic disease characterized by elevated blood glucose levels and poses significant health risks, such as cardiovascular disease and cognitive damage. Understanding the causes of diabetes is crucial to managing it and preventing complications. The clinical community has a lot of diabetes diagnostic data. Machine learning algorithms may simplify finding hidden patterns, retrieving data from databases, and predicting outcomes. To tackle the challenge of designing an improved diabetes classification algorithm that is more accurate, random oversampling and hyper‐tuning parameter techniques have been used in this study. Whereas most of the existing methods were built upon considering any single dataset, for getting more acceptability in general, our proposed model has been designed based on two benchmark datasets: the BRFSS dataset, which has multiple classes, and the Diabetes 2019 dataset, which has binary classes. What is more, to improve the comprehensibility of the proposed model, a variety of explainability methodologies such as SHapley Additive Explanations (SHAP), Partial Dependency, and Local Interpretable Model‐agnostic Explanations (LIME) have been implemented which are not often noticed in the previous works. The detailed explainability charts will enable the end users or practitioners to understand the exact factors of any given diagnostic report. This research focused on classifying type 2 diabetes using machine learning and providing an explanation for the outcomes derived from the model predictions. Random oversampling and quantile transform are used to rectify imbalances in the dataset and guarantee the resilience of model training. By meticulously adjusting parameters with gridsearchCV, we successfully optimized our models to attain exceptional accuracy across binary and multi‐class datasets. We evaluate the proposed model using two datasets and performance metrics. The extra trees classifier (ET) performed exceptionally, achieving 97.23% accuracy on the multi‐class dataset and 97.45% on the binary dataset. The significance of this study is centered around the creation of a precise and resilient diabetes classification framework that is inherently transparent. The results of the framework perform well consistently and it suggests that the framework may be more likely to generalize effectively to new, unseen data.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter