Catalogue Search | MBRL

Fuzzy Hoeffding Decision Tree for Data Stream Classification

by Marcelloni, Francesco , Ducange, Pietro , Pecori, Riccardo in Fuzzy decision tree , Hoeffding decision tree , Model interpretability

2021

Data stream mining has recently grown in popularity, thanks to an increasing number of applications which need continuous and fast analysis of streaming data. Such data are generally produced in application domains that require immediate reactions with strict temporal constraints. These particular characteristics make problematic the use of classical machine learning algorithms for mining knowledge from these fast data streams and call for appropriate techniques. In this paper, based on the well-known Hoeffding Decision Tree (HDT) for streaming data classification, we introduce FHDT, a fuzzy HDT that extends HDT with fuzziness, thus making HDT more robust to noisy and vague data. We tested FHDT on three synthetic datasets, usually adopted for analyzing concept drifts in data stream classification, and two real-world datasets, already exploited in some recent researches on fuzzy systems for streaming data. We show that FHDT outperforms HDT, especially in presence of concept drift. Furthermore, FHDT is characterized by a high level of interpretability, thanks to the linguistic rules that can be extracted from it.

Journal Article

Share this book

Add to My Shelf

CatBoost for big data: an interdisciplinary review

by Hancock, John T. , Khoshgoftaar, Taghi M. in Algorithms , Best practice , Big Data

2020

Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

Journal Article

Share this book

Add to My Shelf

A survival guide to Landsat preprocessing

by Chignell, Stephen M. , Vorster, Anthony G. , Evangelista, Paul H. in Atmospheric correction , change detection , CONCEPTS & SYNTHESIS: EMPHASIZING NEW IDEAS TO STIMULATE RESEARCH IN ECOLOGY

2017

Landsat data are increasingly used for ecological monitoring and research. These data often require preprocessing prior to analysis to account for sensor, solar, atmospheric, and topographic effects. However, ecologists using these data are faced with a literature containing inconsistent terminology, outdated methods, and a vast number of approaches with contradictory recommendations. These issues can, at best, make determining the correct preprocessing workflow a difficult and time-consuming task and, at worst, lead to erroneous results. We address these problems by providing a concise overview of the Landsat missions and sensors and by clarifying frequently conflated terms and methods. Preprocessing steps commonly applied to Landsat data are differentiated and explained, including georeferencing and co-registration, conversion to radiance, solar correction, atmospheric correction, topographic correction, and relative correction. We then synthesize this information by presenting workflows and a decision tree for determining the appropriate level of imagery preprocessing given an ecological research question, while emphasizing the need to tailor each workflow to the study site and question at hand. We recommend a parsimonious approach to Landsat preprocessing that avoids unnecessary steps and recommend approaches and data products that are well tested, easily available, and sufficiently documented. Our focus is specific to ecological applications of Landsat data, yet many of the concepts and recommendations discussed are also appropriate for other disciplines and remote sensing platforms.

Journal Article

Share this book

Add to My Shelf

Classification and Explanation for Intrusion Detection System Based on Ensemble Trees and SHAP Method

by Kim, Howon , Le, Thi-Thu-Huong , Kim, Haeyoung in Accuracy , Algorithms , Classification

2022

In recent years, many methods for intrusion detection systems (IDS) have been designed and developed in the research community, which have achieved a perfect detection rate using IDS datasets. Deep neural networks (DNNs) are representative examples applied widely in IDS. However, DNN models are becoming increasingly complex in model architectures with high resource computing in hardware requirements. In addition, it is difficult for humans to obtain explanations behind the decisions made by these DNN models using large IoT-based IDS datasets. Many proposed IDS methods have not been applied in practical deployments, because of the lack of explanation given to cybersecurity experts, to support them in terms of optimizing their decisions according to the judgments of the IDS models. This paper aims to enhance the attack detection performance of IDS with big IoT-based IDS datasets as well as provide explanations of machine learning (ML) model predictions. The proposed ML-based IDS method is based on the ensemble trees approach, including decision tree (DT) and random forest (RF) classifiers which do not require high computing resources for training models. In addition, two big datasets are used for the experimental evaluation of the proposed method, NF-BoT-IoT-v2, and NF-ToN-IoT-v2 (new versions of the original BoT-IoT and ToN-IoT datasets), through the feature set of the net flow meter. In addition, the IoTDS20 dataset is used for experiments. Furthermore, the SHapley additive exPlanations (SHAP) is applied to the eXplainable AI (XAI) methodology to explain and interpret the classification decisions of DT and RF models; this is not only effective in interpreting the final decision of the ensemble tree approach but also supports cybersecurity experts in quickly optimizing and evaluating the correctness of their judgments based on the explanations of the results.

Journal Article

Share this book

Add to My Shelf

Diabetes prediction using machine learning and explainable AI techniques

by Islam, Sanjida , Nabil, Tansin Ullah , Khan, Riasat in Accuracy , AdaBoost , Algorithms

2023

Globally, diabetes affects 537 million people, making it the deadliest and the most common non‐communicable disease. Many factors can cause a person to get affected by diabetes, like excessive body weight, abnormal cholesterol level, family history, physical inactivity, bad food habit etc. Increased urination is one of the most common symptoms of this disease. People with diabetes for a long time can get several complications like heart disorder, kidney disease, nerve damage, diabetic retinopathy etc. But its risk can be reduced if it is predicted early. In this paper, an automatic diabetes prediction system has been developed using a private dataset of female patients in Bangladesh and various machine learning techniques. The authors used the Pima Indian diabetes dataset and collected additional samples from 203 individuals from a local textile factory in Bangladesh. Feature selection algorithm mutual information has been applied in this work. A semi‐supervised model with extreme gradient boosting has been utilized to predict the insulin features of the private dataset. SMOTE and ADASYN approaches have been employed to manage the class imbalance problem. The authors used machine learning classification methods, that is, decision tree, SVM, Random Forest, Logistic Regression, KNN, and various ensemble techniques, to determine which algorithm produces the best prediction results. After training on and testing all the classification models, the proposed system provided the best result in the XGBoost classifier with the ADASYN approach with 81% accuracy, 0.81 F1 coefficient and AUC of 0.84. Furthermore, the domain adaptation method has been implemented to demonstrate the versatility of the proposed system. The explainable AI approach with LIME and SHAP frameworks is implemented to understand how the model predicts the final results. Finally, a website framework and an Android smartphone application have been developed to input various features and predict diabetes instantaneously. The private dataset of female Bangladeshi patients and programming codes are available at the following link: https://github.com/tansin-nabil/Diabetes-Prediction-Using-Machine-Learning. The novelty of this work is to implement an automatic diabetes prediction website and Android application for a private dataset of female Bangladeshi patients using machine learning and ensemble techniques.

Journal Article

Share this book

Add to My Shelf

Prediction of Compressive Strength of Fly Ash Based Concrete Using Individual and Ensemble Algorithm

by Niewiadomski, Pawel , Alyousef, Rayed , Akbar, Arslan in Algorithms , Bagging , Carbon dioxide

2021

Machine learning techniques are widely used algorithms for predicting the mechanical properties of concrete. This study is based on the comparison of algorithms between individuals and ensemble approaches, such as bagging. Optimization for bagging is done by making 20 sub-models to depict the accurate one. Variables like cement content, fine and coarse aggregate, water, binder-to-water ratio, fly-ash, and superplasticizer are used for modeling. Model performance is evaluated by various statistical indicators like mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE). Individual algorithms show a moderate bias result. However, the ensemble model gives a better result with R2 = 0.911 compared to the decision tree (DT) and gene expression programming (GEP). K-fold cross-validation confirms the model’s accuracy and is done by R2, MAE, MSE, and RMSE. Statistical checks reveal that the decision tree with ensemble provides 25%, 121%, and 49% enhancement for errors like MAE, MSE, and RMSE between the target and outcome response.

Journal Article

Share this book

Add to My Shelf

Pipeline Leakage Detection Using Acoustic Emission and Machine Learning Algorithms

by Kim, Jong-Myon , Ahmed, Zahoor , Ullah, Niamat in acoustic emission , Acoustic emission testing , Acoustics

2023

Pipelines play a significant role in liquid and gas resource distribution. Pipeline leaks, however, result in severe consequences, such as wasted resources, risks to community health, distribution downtime, and economic loss. An efficient autonomous leakage detection system is clearly required. The recent leak diagnosis capability of acoustic emission (AE) technology has been well demonstrated. This article proposes a machine learning-based platform for leakage detection for various pinhole-sized leaks using the AE sensor channel information. Statistical measures, such as kurtosis, skewness, mean value, mean square, root mean square (RMS), peak value, standard deviation, entropy, and frequency spectrum features, were extracted from the AE signal as features to train the machine learning models. An adaptive threshold-based sliding window approach was used to retain the properties of both bursts and continuous-type emissions. First, we collected three AE sensor datasets and extracted 11 time domain and 14 frequency domain features for a one-second window for each AE sensor data category. The measurements and their associated statistics were transformed into feature vectors. Subsequently, these feature data were utilized for training and evaluating supervised machine learning models to detect leaks and pinhole-sized leaks. Several widely known classifiers, such as neural networks, decision trees, random forests, and k-nearest neighbors, were evaluated using the four datasets regarding water and gas leakages at different pressures and pinhole leak sizes. We achieved an exceptional overall classification accuracy of 99%, providing reliable and effective results that are suitable for the implementation of the proposed platform.

Journal Article

Share this book

Add to My Shelf

Comparative Study of Supervised Machine Learning Algorithms for Predicting the Compressive Strength of Concrete at High Temperature

by Farooq, Furqan , Mehmood, Imran , Maślak, Mariusz in Accuracy , Algorithms , Artificial neural networks

2021

High temperature severely affects the nature of the ingredients used to produce concrete, which in turn reduces the strength properties of the concrete. It is a difficult and time-consuming task to achieve the desired compressive strength of concrete. However, the application of supervised machine learning (ML) approaches makes it possible to initially predict the targeted result with high accuracy. This study presents the use of a decision tree (DT), an artificial neural network (ANN), bagging, and gradient boosting (GB) to forecast the compressive strength of concrete at high temperatures on the basis of 207 data points. Python coding in Anaconda navigator software was used to run the selected models. The software requires information regarding both the input variables and the output parameter. A total of nine input parameters (water, cement, coarse aggregate, fine aggregate, fly ash, superplasticizers, silica fume, nano silica, and temperature) were incorporated as the input, while one variable (compressive strength) was selected as the output. The performance of the employed ML algorithms was evaluated with regards to statistical indicators, including the coefficient correlation (R2), mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE). Individual models using DT and ANN gave R2 equal to 0.83 and 0.82, respectively, while the use of the ensemble algorithm and gradient boosting gave R2 of 0.90 and 0.88, respectively. This indicates a strong correlation between the actual and predicted outcomes. The k-fold cross-validation, coefficient correlation (R2), and lesser errors (MAE, MSE, and RMSE) showed better performance than the ensemble algorithms. Sensitivity analyses were also conducted in order to check the contribution of each input variable. It has been shown that the use of the ensemble machine learning algorithm would enhance the performance level of the model.

Journal Article

Share this book

Add to My Shelf

Mapping of cropland, cropping patterns and crop types by combining optical remote sensing images with decision tree classifier and random forest

by Riaz Khan, Mobushir , Tariq, Aqil , Yan, Jianguo in Agricultural land , Algorithms , Barley

2023

Mapping and monitoring the distribution of croplands and crop types support policymakers and international organizations by reducing the risks to food security, notably from climate change and, for that purpose, remote sensing is routinely used. However, identifying specific crop types, cropland, and cropping patterns using space-based observations is challenging because different crop types and cropping patterns have similarity spectral signatures. This study applied a methodology to identify cropland and specific crop types, including tobacco, wheat, barley, and gram, as well as the following cropping patterns: wheat-tobacco, wheat-gram, wheat-barley, and wheat-maize, which are common in Gujranwala District, Pakistan, the study region. The methodology consists of combining optical remote sensing images from Sentinel-2 and Landsat-8 with Machine Learning (ML) methods, namely a Decision Tree Classifier (DTC) and a Random Forest (RF) algorithm. The best time-periods for differentiating cropland from other land cover types were identified, and then Sentinel-2 and Landsat 8 NDVI-based time-series were linked to phenological parameters to determine the different crop types and cropping patterns over the study region using their temporal indices and ML algorithms. The methodology was subsequently evaluated using Landsat images, crop statistical data for 2020 and 2021, and field data on cropping patterns. The results highlight the high level of accuracy of the methodological approach presented using Sentinel-2 and Landsat-8 images, together with ML techniques, for mapping not only the distribution of cropland, but also crop types and cropping patterns when validated at the county level. These results reveal that this methodology has benefits for monitoring and evaluating food security in Pakistan, adding to the evidence base of other studies on the use of remote sensing to identify crop types and cropping patterns in other countries.

Journal Article

Share this book

Add to My Shelf

Decision tree classifiers for automated medical diagnosis

by Azar, Ahmad Taher , El-Metwally, Shereen M. in Applied sciences , Artificial Intelligence , Biological and medical sciences

2013

Decision support systems help physicians and also play an important role in medical decision-making. They are based on different models, and the best of them are providing an explanation together with an accurate, reliable and quick response. This paper presents a decision support tool for the detection of breast cancer based on three types of decision tree classifiers. They are single decision tree (SDT), boosted decision tree (BDT) and decision tree forest (DTF). Decision tree classification provides a rapid and effective method of categorizing data sets. Decision-making is performed in two stages: training the classifiers with features from Wisconsin breast cancer data set, and then testing. The performance of the proposed structure is evaluated in terms of accuracy, sensitivity, specificity, confusion matrix and receiver operating characteristic (ROC) curves. The results showed that the overall accuracies of SDT and BDT in the training phase achieved 97.07 % with 429 correct classifications and 98.83 % with 437 correct classifications, respectively. BDT performed better than SDT for all performance indices than SDT. Value of ROC and Matthews correlation coefficient (MCC) for BDT in the training phase achieved 0.99971 and 0.9746, respectively, which was superior to SDT classifier. During validation phase, DTF achieved 97.51 %, which was superior to SDT (95.75 %) and BDT (97.07 %) classifiers. Value of ROC and MCC for DTF achieved 0.99382 and 0.9462, respectively. BDT showed the best performance in terms of sensitivity, and SDT was the best only considering speed.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter