Catalogue Search | MBRL

Classification and Explanation for Intrusion Detection System Based on Ensemble Trees and SHAP Method

by Kim, Howon , Le, Thi-Thu-Huong , Kim, Haeyoung in Accuracy , Algorithms , Classification

2022

In recent years, many methods for intrusion detection systems (IDS) have been designed and developed in the research community, which have achieved a perfect detection rate using IDS datasets. Deep neural networks (DNNs) are representative examples applied widely in IDS. However, DNN models are becoming increasingly complex in model architectures with high resource computing in hardware requirements. In addition, it is difficult for humans to obtain explanations behind the decisions made by these DNN models using large IoT-based IDS datasets. Many proposed IDS methods have not been applied in practical deployments, because of the lack of explanation given to cybersecurity experts, to support them in terms of optimizing their decisions according to the judgments of the IDS models. This paper aims to enhance the attack detection performance of IDS with big IoT-based IDS datasets as well as provide explanations of machine learning (ML) model predictions. The proposed ML-based IDS method is based on the ensemble trees approach, including decision tree (DT) and random forest (RF) classifiers which do not require high computing resources for training models. In addition, two big datasets are used for the experimental evaluation of the proposed method, NF-BoT-IoT-v2, and NF-ToN-IoT-v2 (new versions of the original BoT-IoT and ToN-IoT datasets), through the feature set of the net flow meter. In addition, the IoTDS20 dataset is used for experiments. Furthermore, the SHapley additive exPlanations (SHAP) is applied to the eXplainable AI (XAI) methodology to explain and interpret the classification decisions of DT and RF models; this is not only effective in interpreting the final decision of the ensemble tree approach but also supports cybersecurity experts in quickly optimizing and evaluating the correctness of their judgments based on the explanations of the results.

Journal Article

Share this book

Add to My Shelf

Machine Learning-Based Gully Erosion Susceptibility Mapping: A Case Study of Eastern India

by Roy, Jagabandhu , Saha, Sunil , Blaschke, Thomas in geographical information system (gis) , gradient boosted regression tree (gbrt) , naïve bayes tree (nbt)

2020

Gully erosion is a form of natural disaster and one of the land loss mechanisms causing severe problems worldwide. This study aims to delineate the areas with the most severe gully erosion susceptibility (GES) using the machine learning techniques Random Forest (RF), Gradient Boosted Regression Tree (GBRT), Naïve Bayes Tree (NBT), and Tree Ensemble (TE). The gully inventory map (GIM) consists of 120 gullies. Of the 120 gullies, 84 gullies (70%) were used for training and 36 gullies (30%) were used to validate the models. Fourteen gully conditioning factors (GCFs) were used for GES modeling and the relationships between the GCFs and gully erosion was assessed using the weight-of-evidence (WofE) model. The GES maps were prepared using RF, GBRT, NBT, and TE and were validated using area under the receiver operating characteristic (AUROC) curve, the seed cell area index (SCAI) and five statistical measures including precision (PPV), false discovery rate (FDR), accuracy, mean absolute error (MAE), and root mean squared error (RMSE). Nearly 7% of the basin has high to very high susceptibility for gully erosion. Validation results proved the excellent ability of these models to predict the GES. Of the analyzed models, the RF (AUROC = 0.96, PPV = 1.00, FDR = 0.00, accuracy = 0.87, MAE = 0.11, RMSE = 0.19 for validation dataset) is accurate enough for modeling and better suited for GES modeling than the other models. Therefore, the RF model can be used to model the GES areas not only in this river basin but also in other areas with the same geo-environmental conditions.

Journal Article

Share this book

Add to My Shelf

Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest

by Sahin, Emrehan Kutlug in 2. Earth and Environmental Sciences (general) , Accuracy , Algorithms

2020

Decision tree-based classifier ensemble methods are a machine learning (ML) technique that combines several tree models to produce an effective or optimum predictive model, and that allows well-predictive performance especially compared to a single model. Thus, selecting a proper ML algorithm help us to understand possible future occurrences by analyzing the past more accurate. The main purpose of this study is to produce landslide susceptibility map of the Ayancik district of Sinop province, situated in the Black Sea region of Turkey using three featured regression tree-based ensemble methods including gradient boosting machines (GBM), extreme gradient boosting (XGBoost), and random forest (RF). Fifteen landslide causative factors and 105 landslide locations occurred in the region were used. The landslide inventory map was randomly divided into training (70%) and testing (30%) dataset to construct the RF, XGBoost and GBM prediction models. Symmetrical uncertainty measure was utilized to determine the most important causative factors, and then the selected features were used to construct susceptibility prediction models. The performance of the ensemble models was validated using different accuracy metrics including Area under the curve (AUC), overall accuracy (OA), Root mean square error (RMSE), and Kappa coefficient. Also, the Wilcoxon signed-rank test was used to assess differences between optimum models. The accuracy results showed that the model of XgBoost_Opt model (the model created by optimum factor combination) has the highest prediction capability (OA = 0.8501 and AUC = 0.8976), followed by the RF_opt (OA = 0.8336 and AUC = 0.8860) and GBM_Opt (OA = 0.8244 and AUC = 0.8796). When the Wilcoxon sign-rank test results were analyzed, XgBoost_Opt model, which is the best subset combinations, were confirmed to be statistically significant considering other models. The results showed that, the XGBoost method according to optimum model achieved lower prediction error and higher accuracy results than the other ensemble methods.

Journal Article

Share this book

Add to My Shelf

Improved intelligent methods for power transformer fault diagnosis based on tree ensemble learning and multiple feature vector analysis

by Hechifa, Abdelmoumene , Lakehal, Abdelaziz , Saidi, Lotfi in Accuracy , Algorithms , Clustering

2024

This paper discusses the impact of the feature input vector on the performance of dissolved gas analysis-based intelligent power transformer fault diagnosis methods. For this purpose, 22 feature vectors from traditional diagnostic methods were used as feature input vectors for four tree-based ensemble algorithms, namely random forest, tree ensemble, gradient boosted tree, and extreme gradient tree. To build the proposed diagnostics models, 407 samples were used for training and testing. For validation and comparison with the existing methods of literature, 89 samples were used. Based on the results obtained on the training and testing datasets, the best performance was achieved with feature vector 16, which consists of the gas ratios of Rogers’ four ratios method and the three ratios technique. The test accuracies based on these vectors are 98.37, 96.75, 95.93, and 97.56% for the namely random forest, tree ensemble, gradient boosted tree, and extreme gradient tree algorithms, respectively. Furthermore, the performance of the methods based on best input feature was evaluated and compared with other methods of literature such as Duval triangle, modified Rogers’ four ratios method, combined technique, three ratios technique, Gouda triangle, IEC 60599, NBR 7274, the clustering method, and key gases with gas ratio methods. These methods suffer from unreliability, and this is the motivation behind the current work to develop a new technique that enhances the diagnostic accuracy of transformer faults to avoid unwanted faults and outages from the network. On validating dataset, diagnostic accuracies of 92.13, 91.01, 89.89, and 91.01% were achieved by the namely random forest, tree ensemble, gradient boosted tree, and extreme gradient tree models, respectively. These diagnostic accuracies are higher than 83.15% for the clustering method, 82.02% for the combined technique, 80.90% for the modified IEC 60599, and 79.78% for key gases with gas ratios, which are the best existing methods. Even if the performance of dissolved gas analysis-based intelligent methods depends strongly on the shape of the feature vector used, this study provides scholars with a tool for choosing the feature vector to use when implementing these methods.

Journal Article

Share this book

Add to My Shelf

Detecting Web-Based Attacks with SHAP and Tree Ensemble Machine Learning Methods

by Daisuke Inoue , Seiichi Ozawa , Tao Ban in Biology (General) , Chemistry , Engineering (General). Civil engineering (General)

2021

Journal Article

Share this book

Add to My Shelf

Tree-Based Methods of Volatility Prediction for the S P 500 Index

by Marin Lolic in GARCH , machine learning , prediction

2025

Predicting asset return volatility is one of the central problems in quantitative finance. These predictions are used for portfolio construction, calculation of value at risk (VaR), and pricing of derivatives such as options. Classical methods of volatility prediction utilize historical returns data and include the exponentially weighted moving average (EWMA) and generalized autoregressive conditional heteroskedasticity (GARCH). These approaches have shown significantly higher rates of predictive accuracy than corresponding methods of return forecasting, but they still have vast room for improvement. In this paper, we propose and test several methods of volatility forecasting on the S&P 500 Index using tree ensembles from machine learning, namely random forest and gradient boosting. We show that these methods generally outperform the classical approaches across a variety of metrics on out-of-sample data. Finally, we use the unique properties of tree-based ensembles to assess what data can be particularly useful in predicting asset return volatility.

Journal Article

Share this book

Add to My Shelf

A new method for exploring gene–gene and gene–environment interactions in GWAS with tree ensemble methods and SHAP values

by Johnsen, Pål V. , Riemer-Sørensen, Signe , Langaas, Mette in Algorithms , Analysis , Biobanks

2021

Background The identification of gene–gene and gene–environment interactions in genome-wide association studies is challenging due to the unknown nature of the interactions and the overwhelmingly large number of possible combinations. Parametric regression models are suitable to look for prespecified interactions. Nonparametric models such as tree ensemble models, with the ability to detect any unspecified interaction, have previously been difficult to interpret. However, with the development of methods for model explainability, it is now possible to interpret tree ensemble models efficiently and with a strong theoretical basis. Results We propose a tree ensemble- and SHAP-based method for identifying as well as interpreting potential gene–gene and gene–environment interactions on large-scale biobank data. A set of independent cross-validation runs are used to implicitly investigate the whole genome. We apply and evaluate the method using data from the UK Biobank with obesity as the phenotype. The results are in line with previous research on obesity as we identify top SNPs previously associated with obesity. We further demonstrate how to interpret and visualize interaction candidates. Conclusions The new method identifies interaction candidates otherwise not detected with parametric regression models. However, further research is needed to evaluate the uncertainties of these candidates. The method can be applied to large-scale biobanks with high-dimensional data.

Journal Article

Share this book

Add to My Shelf

Drug-target interaction prediction with tree-ensemble learning and output space reconstruction

by Vens, Celine , Pliakos, Konstantinos in Algorithms , Analysis , Benchmarking

2020

Background Computational prediction of drug-target interactions (DTI) is vital for drug discovery. The experimental identification of interactions between drugs and target proteins is very onerous. Modern technologies have mitigated the problem, leveraging the development of new drugs. However, drug development remains extremely expensive and time consuming. Therefore, in silico DTI predictions based on machine learning can alleviate the burdensome task of drug development. Many machine learning approaches have been proposed over the years for DTI prediction. Nevertheless, prediction accuracy and efficiency are persisting problems that still need to be tackled. Here, we propose a new learning method which addresses DTI prediction as a multi-output prediction task by learning ensembles of multi-output bi-clustering trees (eBICT) on reconstructed networks. In our setting, the nodes of a DTI network (drugs and proteins) are represented by features (background information). The interactions between the nodes of a DTI network are modeled as an interaction matrix and compose the output space in our problem. The proposed approach integrates background information from both drug and target protein spaces into the same global network framework. Results We performed an empirical evaluation, comparing the proposed approach to state of the art DTI prediction methods and demonstrated the effectiveness of the proposed approach in different prediction settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein networks. We show that output space reconstruction can boost the predictive performance of tree-ensemble learning methods, yielding more accurate DTI predictions. Conclusions We proposed a new DTI prediction method where bi-clustering trees are built on reconstructed networks. Building tree-ensemble learning models with output space reconstruction leads to superior prediction results, while preserving the advantages of tree-ensembles, such as scalability, interpretability and inductive setting.

Journal Article

Share this book

Add to My Shelf

Decision-based evasion attacks on tree ensemble classifiers

by Wang, Yi , Liu, Shigang , Wang, Hua in Algorithms , Classifiers , Decision trees

2020

Learning-based classifiers are found to be susceptible to adversarial examples. Recent studies suggested that ensemble classifiers tend to be more robust than single classifiers against evasion attacks. In this paper, we argue that this is not necessarily the case. In particular, we show that a discrete-valued random forest classifier can be easily evaded by adversarial inputs manipulated based only on the model decision outputs. The proposed evasion algorithm is gradient free and can be fast implemented. Our evaluation results demonstrate that random forests can be even more vulnerable than SVMs, either single or ensemble, to evasion attacks under both white-box and the more realistic black-box settings.

Journal Article

Share this book

Add to My Shelf

Probabilistic machine learning for predicting desiccation cracks in clayey soils

by Xu, Yongfu , Costa, Susanga , Jamhiri, Babak in Earth and Environmental Science , Earth Sciences , Foundations

2023

With frequent heatwaves and drought-downpour cycles, climate change gives rise to severe desiccation cracks. In this research, a probabilistic machine learning (ML) framework is developed to improve the deterministic models. Therefore, a complete set of data-driven soil and environment parameters, including initial water content (IWC), crack water content (CWC), final water content (FWC), soil layer thickness (SLT), temperature (Temp), and relative humidity (RH), is utilized as inputs to predict the crack surface ratio (CSR). Also, a comprehensive set of MLs, including an ensemble of regression trees (i.e., random forests [RF] and regression trees [RT]), gradient-boosted trees (viz. GBT and XGBT), support-vector machines (SVM), and artificial neural network-particle swarm optimization (ANN-PSO), is developed for predictions. Monte Carlo simulation (MCS) is then employed to insert uncertainties in the given models via shuffling and randomizing samples. Two sensitivity analyses, in particular input exclusion and partial dependence-individual conditional expectation plots, are further established to assess the prediction reliability. Results indicate that the performance ranking of developed MLs can be put as SVM > GBT > XGBT > ANN-PSO > RF > RT. However, according to the probabilistic modeling based on the MCS, GBTs are highly capable for predictions with the lowest errors and uncertainties. The performance order of the models in terms of the higher coefficient of determination and lower standard deviation is GBT > SVM > XGBT > RF > ANN-PSO > RT. The sensitivity analyses also categorized the parameter importance in the order of FWC > CWC > SLT > IWC > Temp > RH. These findings demonstrate the immense capabilities of probabilistic MLs under uncertainties by measuring prediction error variances and hence improving performance precision.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter