Catalogue Search | MBRL

A comparative analysis of gradient boosting algorithms

in Accuracy , Algorithms , Comparative analysis

2021

The family of gradient boosting algorithms has been recently extended with several interesting proposals (i.e. XGBoost, LightGBM and CatBoost) that focus on both speed and accuracy. XGBoost is a scalable ensemble technique that has demonstrated to be a reliable and efficient machine learning challenge solver. LightGBM is an accurate model focused on providing extremely fast training performance using selective sampling of high gradient instances. CatBoost modifies the computation of gradients to avoid the prediction shift in order to improve the accuracy of the model. This work proposes a practical analysis of how these novel variants of gradient boosting work in terms of training speed, generalization performance and hyper-parameter setup. In addition, a comprehensive comparison between XGBoost, LightGBM, CatBoost, random forests and gradient boosting has been performed using carefully tuned models as well as using their default settings. The results of this comparison indicate that CatBoost obtains the best results in generalization accuracy and AUC in the studied datasets although the differences are small. LightGBM is the fastest of all methods but not the most accurate. Finally, XGBoost places second both in accuracy and in training speed. Finally an extensive analysis of the effect of hyper-parameter tuning in XGBoost, LightGBM and CatBoost is carried out using two novel proposed tools.

Journal Article

Share this book

Add to My Shelf

Combination of Feature Selection and CatBoost for Prediction: The First Application to the Estimation of Aboveground Biomass

by Qiao, Jingjing , Zhou, Lai , Sun, Yujun in aboveground biomass , algorithms , coniferous forests

2021

Increasing numbers of explanatory variables tend to result in information redundancy and “dimensional disaster” in the quantitative remote sensing of forest aboveground biomass (AGB). Feature selection of model factors is an effective method for improving the accuracy of AGB estimates. Machine learning algorithms are also widely used in AGB estimation, although little research has addressed the use of the categorical boosting algorithm (CatBoost) for AGB estimation. Both feature selection and regression for AGB estimation models are typically performed with the same machine learning algorithm, but there is no evidence to suggest that this is the best method. Therefore, the present study focuses on evaluating the performance of the CatBoost algorithm for AGB estimation and comparing the performance of different combinations of feature selection methods and machine learning algorithms. AGB estimation models of four forest types were developed based on Landsat OLI data using three feature selection methods (recursive feature elimination (RFE), variable selection using random forests (VSURF), and least absolute shrinkage and selection operator (LASSO)) and three machine learning algorithms (random forest regression (RFR), extreme gradient boosting (XGBoost), and categorical boosting (CatBoost)). Feature selection had a significant influence on AGB estimation. RFE preserved the most informative features for AGB estimation and was superior to VSURF and LASSO. In addition, CatBoost improved the accuracy of the AGB estimation models compared with RFR and XGBoost. AGB estimation models using RFE for feature selection and CatBoost as the regression algorithm achieved the highest accuracy, with root mean square errors (RMSEs) of 26.54 Mg/ha for coniferous forest, 24.67 Mg/ha for broad-leaved forest, 22.62 Mg/ha for mixed forests, and 25.77 Mg/ha for all forests. The combination of RFE and CatBoost had better performance than the VSURF–RFR combination in which random forests were used for both feature selection and regression, indicating that feature selection and regression performed by a single machine learning algorithm may not always ensure optimal AGB estimation. It is promising to extending the application of new machine learning algorithms and feature selection methods to improve the accuracy of AGB estimates.

Journal Article

Share this book

Add to My Shelf

Forecasting gold price with the XGBoost algorithm and SHAP interaction values

by Jabeur, Sami Ben , Mefteh-Wali, Salma , Viviani, Jean-Laurent in Algorithms , Artificial intelligence , Commodity prices

2024

Financial institutions, investors, mining companies and related firms need an effective accurate forecasting model to examine gold price fluctuations in order to make correct decisions. This paper proposes an innovative approach to accurately forecast gold price movements and to interpret predictions. First, it compares six machine learning models. These models include two very recent methods: the eXtreme Gradient Boosting (XGBoost) and CatBoost. The empirical findings indicate the superiority of XGBoost over other advanced machine learning models. Second, it proposes Shapley additive explanations (SHAP) in order to help policy makers to interpret the predictions of complex machine learning models and to examine the importance of various features that affect gold prices. Our results illustrate that the utilization of XGBoost along with SHAP approach could provide a significant boost in increasing the gold price forecasting performance.

Journal Article

Share this book

Add to My Shelf

Design of an english oral dialogue generation and interaction system assisted by machine learning

by Liu, Zhihui , Tian, Jiangli in Accuracy , Algorithms , Artificial Intelligence

2026

Oral English proficiency is crucial for effective communication; however, most existing learning systems lack adaptability, interactivity, and personalized feedback. Conventional platforms primarily employ rule-based dialogue structures, which limits their ability to dynamically respond to diverse learner needs and varying speech patterns. To address this constraint, an intelligent English oral dialogue generation and interaction system is established using machine learning (ML). The system incorporates an Invasive Weed Optimized Intelligent CatBoost (IWO-tuned CatBoost) model, which enhances decision-making in dialogue response generation, vocabulary recommendation, and feedback adaptation. A synthetic dataset including annotated learner dialogues with grammatical and phonetic error labels supports model training. Preprocessing involves Automatic Speech Recognition (ASR)-based voice-to-text conversion, noise reduction, lemmatization, and stop word elimination. Feature extraction utilizes Mel-Frequency Cepstral Coefficients (MFCCs) for phonetic representation and Natural Language Processing (NLP)-based syntactic parsing through NLTK and spaCy for textual structure analysis. The IWO algorithm optimizes CatBoost hyperparameters to improve classification accuracy and system adaptability across diverse learner profiles. Model training is conducted in a simulated dialogue environment, enabling progressive refinement of response logic and interaction quality. The system is implemented using Python, and evaluation results indicate high performance across multiple metrics, including a BLEU score (0.82), ROUGE-L (0.79), METEOR (0.76), Engagement Score (0.84), and User Satisfaction Index (0.83), alongside a WER (0.058) and a SER (0.094). This approach demonstrates a robust and scalable framework for delivering personalized, interactive, and efficient oral English learning experiences.

Journal Article

Share this book

Add to My Shelf

Applications of machine learning in predicting rut depth in off-road environments

by Mardani, Aref , Farhadi, Nashmil , Golanbari, Behzad in 639/166/988 , 639/705/1042 , Algorithms

2025

The rut depth created by off-road vehicles affects vehicle performance and soil compaction, and its accurate prediction is essential to improve vehicle performance and reduce soil compaction. Due to the complex and nonlinear interactions between variables and rut depth, the error in estimating rut depth with conventional methods is significant. Therefore, the present study aims to predict the rut depth created by off-road vehicles in soil using the Categorical Boosting (CatBoost) machine learning algorithm and combining it with three optimization methods, the Gray Wolf Optimization (GWO) algorithm, Particle Swarm Optimization (PSO), and the Secretary Bird Optimization Algorithm (SBOA). The experimental data included 270 samples with vertical load variables (2, 3, and 4 kN), movement speed (1, 2, and 3 km/h), two traction devices (pneumatic tire and tracked wheel), and the number of passes (15 levels), which were collected under indoor conditions using a soil bin equipped with a single-wheel tester. The model hyperparameters were adjusted using the GWO and SBOA algorithms to increase the prediction accuracy and reduce the model error. The results showed that the SBOA-CatBoost hybrid model, with a Root Mean Square Error of 0.35 mm and a coefficient of determination of 0.97707, performed better than the other models. Furthermore, the SBOA-CatBoost hybrid model outperformed the other models with a Mean Absolute Percentage Error of 1.2%.

Journal Article

Share this book

Add to My Shelf

Predicting biomarkers from classifier for liver metastasis of colorectal adenocarcinomas using machine learning models

by Xi, Yang , Jing, Zhuang , Shuwen, Han in Adenocarcinoma , Biomarkers , CatBoost algorithm

2020

Background Early diagnosis of liver metastasis is of great importance for enhancing the survival of colorectal adenocarcinoma (CAD) patients, and the combined use of a single biomarker in a classier model has shown great improvement in predicting the metastasis of several types of cancers. However, it is little reported for CAD. This study therefore aimed to screen an optimal classier model of CAD with liver metastasis and explore the metastatic mechanisms of genes when applying this classier model. Methods The differentially expressed genes between primary CAD samples and CAD with metastasis samples were screened from the Moffitt Cancer Center (MCC) dataset GSE131418. The classification performances of six selected algorithms, namely, LR, RF, SVM, GBDT, NN, and CatBoost, for classification of CAD with liver metastasis samples were compared using the MCC dataset GSE131418 by detecting their classification test accuracy. In addition, the consortium datasets of GSE131418 and GSE81558 were used as internal and external validation sets to screen the optimal method. Subsequently, functional analyses and a drug‐targeted network construction of the feature genes when applying the optimal method were conducted. Results The optimal CatBoost model with the highest accuracy of 99%, and an area under the curve of 1, was screened, which consisted of 33 feature genes. A functional analysis showed that the feature genes were closely associated with a “steroid metabolic process” and “lipoprotein particle receptor binding” (eg APOB and APOC3). In addition, the feature genes were significantly enriched in the “complement and coagulation cascade” pathways (eg FGA, F2, and F9). In a drug‐target interaction network, F2 and F9 were predicted as targets of menadione. Conclusion The CatBoost model constructed using 33 feature genes showed the optimal classification performance for identifying CAD with liver metastasis. APOB, APOC3, FGA, F2, F9, and NKX2‐3 were potential biomarkers for classification of CAD with liver metastasis. Menadione might be a promising anti‐metastatic drug of CAD cells through functioning its role at sites of F2 and F9. CatBoost model constructed by 33 feature genes showed the optimal classification performance for identifying CAD liver metastasis.

Journal Article

Share this book

Add to My Shelf

CatBoost for big data: an interdisciplinary review

by Hancock, John T. , Khoshgoftaar, Taghi M. in Algorithms , Best practice , Big Data

2020

Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

Journal Article

Share this book

Add to My Shelf

Advanced Machine Learning Techniques to Improve Hydrological Prediction: A Comparative Analysis of Streamflow Prediction Models

by Kumar, Vijendra , Sharma, Kul Vaibhav , Mehta, Darshan J. in Accuracy , Aquatic resources , Artificial intelligence

2023

The management of water resources depends heavily on hydrological prediction, and advances in machine learning (ML) present prospects for improving predictive modelling capabilities. This study investigates the use of a variety of widely used machine learning algorithms, such as CatBoost, ElasticNet, k-Nearest Neighbors (KNN), Lasso, Light Gradient Boosting Machine Regressor (LGBM), Linear Regression (LR), Multilayer Perceptron (MLP), Random Forest (RF), Ridge, Stochastic Gradient Descent (SGD), and the Extreme Gradient Boosting Regression Model (XGBoost), to predict the river inflow of the Garudeshwar watershed, a key element in planning for flood control and water supply. The substantial engineering feature used in the study, which incorporates temporal lag and contextual data based on Indian seasons, leads it distinctiveness. The study concludes that the CatBoost method demonstrated remarkable performance across various metrics, including Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R-squared (R2) values, for both training and testing datasets. This was accomplished by an in-depth investigation and model comparison. In contrast to CatBoost, XGBoost and LGBM demonstrated a higher percentage of data points with prediction errors exceeding 35% for moderate inflow numbers above 10,000. CatBoost established itself as a reliable method for hydrological time-series modelling, easily managing both categorical and continuous variables, and thereby greatly enhancing prediction accuracy. The results of this study highlight the value and promise of widely used machine learning algorithms in hydrology and offer valuable insights for academics and industry professionals.

Journal Article

Share this book

Add to My Shelf

Machine Learning Techniques to Predict the Air Quality Using Meteorological Data in Two Urban Areas in Sri Lanka

by Azamathulla, Hazi Md , Mampitiya, Lakindu , Rathnayake, Namal in Air pollution , Air quality , Air quality measurements

2023

The effect of bad air quality on human health is a well-known risk. Annual health costs have significantly been increased in many countries due to adverse air quality. Therefore, forecasting air quality-measuring parameters in highly impacted areas is essential to enhance the quality of life. Though this forecasting is usual in many countries, Sri Lanka is far behind the state-of-the-art. The country has increasingly reported adverse air quality levels with ongoing industrialization in urban areas. Therefore, this research study, for the first time, mainly focuses on forecasting the PM10 values of the air quality for the two urbanized areas of Sri Lanka, Battaramulla (an urban area in Colombo), and Kandy. Twelve air quality parameters were used with five models, including extreme gradient boosting (XGBoost), CatBoost, light gradient-boosting machine (LightBGM), long short-term memory (LSTM), and gated recurrent unit (GRU) to forecast the PM10 levels. Several performance indices, including the coefficient of determination (R2), root mean squared error (RMSE), mean absolute error (MAE), mean squared error (MSE), mean absolute relative error (MARE), and the Nash–Sutcliffe efficiency (NSE), were used to test the forecasting models. It was identified that the LightBGM algorithm performed better in forecasting PM10 in Kandy (R2=0.99, MSE =0.02, MAE=0.002, RMSE =0.1225, MARE =1.0, and NSE=0.99). In contrast, the LightBGM achieved a higher performance (R2=0.99, MSE =0.002, MAE =0.012 , RMSE =1.051, MARE =0.00, and NSE=0.99) for the forecasting PM10 for the Battaramulla region. As per the results, it can be concluded that there is a necessity to develop forecasting models for different land areas. Moreover, it was concluded that the PM10 in Kandy and Battaramulla increased slightly with existing seasonal changes.

Journal Article

Share this book

Add to My Shelf

Effects of non-landslide sampling strategies on machine learning models in landslide susceptibility mapping

by Wang, Mingguo , Duan, Ping , Zhang, Yanke in 639/705/258 , 704/172/4081 , 704/4111

2024

This study aims to explore the effects of different non-landslide sampling strategies on machine learning models in landslide susceptibility mapping. Non-landslide samples are inherently uncertain, and the selection of non-landslide samples may suffer from issues such as noisy or insufficient regional representations, which can affect the accuracy of the results. In this study, a positive-unlabeled (PU) bagging semi-supervised learning method was introduced for non-landslide sample selection. In addition, buffer control sampling (BCS) and K-means (KM) clustering were applied for comparative analysis. Based on landslide data from Qiaojia County, Yunnan Province, China, collected in 2014, three machine learning models, namely, random forest, support vector machine, and CatBoost, were used for landslide susceptibility mapping. The results show that the quality of samples selected using different non-landslide sampling strategies varies significantly. Overall, the quality of non-landslide samples selected using the PU bagging method is superior, and this method performs best when combined with CatBoost for predicting (AUC = 0.897) landslides in very high and high susceptibility zones (82.14%). Additionally, the KM results indicated overfitting, displaying high accuracy for validation but poor statistical outcomes for zoning. The BCS results were the worst.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter