Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
60
result(s) for
"Extremely Randomized Trees"
Sort by:
Development of a Prediction Method of Cell Density in Autotrophic/Heterotrophic Microorganism Mixtures by Machine Learning Using Absorbance Spectrum Data
by
Akihito Nakanishi
,
Hiroaki Fukunishi
,
Fumihito Eguchi
in
Absorbance
,
Algorithms
,
Artificial intelligence
2022
Microflora is actively used to produce value-added materials in industry, and each cell density should be controlled for stable microflora use. In this study, a simple system evaluating the cell density was constructed with artificial intelligence (AI) using the absorbance spectra data of microflora. To set up the system, the prediction system for cell density based on machine learning was constructed using the spectra data as the feature from the mixture of Saccharomyces cerevisiae and Chlamydomonas reinhardtii. As the results of predicting cell density by extremely randomized trees, when the cell densities of S. cerevisiae and C. reinhardtii were shifted and fixed, the coefficient of determination (R2) was 0.8495; on the other hand, when the cell densities of S. cerevisiae and C. reinhardtii were fixed and shifted, the R2 was 0.9232. To explain the prediction system, the randomized trees regressor of the decision tree-based ensemble learning method as the machine learning algorithm and Shapley additive explanations (SHAPs) as the explainable AI (XAI) to interpret the features contributing to the prediction results were used. As a result of the SHAP analyses, not only the optical density, but also the absorbance of the Soret and Q bands derived from the chloroplasts of C. reinhardtii could contribute to the prediction as the features. The simple cell density evaluating system could have an industrial impact.
Journal Article
Flash Flood Susceptibility Modeling Using New Approaches of Hybrid and Ensemble Tree-Based Machine Learning Algorithms
by
Saha, Asish
,
Melesse, Assefa M.
,
Chandra Pal, Subodh
in
adverse effects
,
Algorithms
,
altitude
2020
Flash flooding is considered one of the most dynamic natural disasters for which measures need to be taken to minimize economic damages, adverse effects, and consequences by mapping flood susceptibility. Identifying areas prone to flash flooding is a crucial step in flash flood hazard management. In the present study, the Kalvan watershed in Markazi Province, Iran, was chosen to evaluate the flash flood susceptibility modeling. Thus, to detect flash flood-prone zones in this study area, five machine learning (ML) algorithms were tested. These included boosted regression tree (BRT), random forest (RF), parallel random forest (PRF), regularized random forest (RRF), and extremely randomized trees (ERT). Fifteen climatic and geo-environmental variables were used as inputs of the flash flood susceptibility models. The results showed that ERT was the most optimal model with an area under curve (AUC) value of 0.82. The rest of the models’ AUC values, i.e., RRF, PRF, RF, and BRT, were 0.80, 0.79, 0.78, and 0.75, respectively. In the ERT model, the areal coverage for very high to moderate flash flood susceptible area was 582.56 km2 (28.33%), and the rest of the portion was associated with very low to low susceptibility zones. It is concluded that topographical and hydrological parameters, e.g., altitude, slope, rainfall, and the river’s distance, were the most effective parameters. The results of this study will play a vital role in the planning and implementation of flood mitigation strategies in the region.
Journal Article
A machine learning algorithm to explore the drivers of carbon emissions in Chinese cities
2024
As the world’s largest energy consumer and carbon emitter, the task of carbon emission reduction is imminent. In order to realize the dual-carbon goal at an early date, it is necessary to study the key factors affecting China’s carbon emissions and their non-linear relationships. This paper compares the performance of six machine learning algorithms to that of traditional econometric models in predicting carbon emissions in China from 2011 to 2020 using panel data from 254 cities in China. Specifically, it analyzes the comparative importance of domestic economic, external economic, and policy uncertainty factors as well as the nonparametric relationship between these factors and carbon emissions based on the Extra-trees model. Results show that energy consumption (ENC) remains the root cause of increased carbon emissions among domestic economic factors, although government intervention (GOV) and digital finance (DIG) can significantly reduce it. Next, among the external economic and policy uncertainty factors, foreign direct investment (FDI) and economic policy uncertainty (EPU) are important factors influencing carbon emissions, and the partial dependence plots (PDPs) confirm the pollution haven hypothesis and also reveal the role of EPU in reducing carbon emissions. The heterogeneity of factors affecting carbon emissions is also analyzed under different city sizes, and it is found that ENC is a common driving factor in cities of different sizes, but there are some differences. Finally, appropriate policy recommendations are proposed by us to help China move rapidly towards a green and sustainable development path.
Journal Article
Evaluating classifier performance with highly imbalanced Big Data
by
Hancock, John T
,
Khoshgoftaar, Taghi M
,
Johnson, Justin M
in
Big Data
,
Classification
,
Classifiers
2023
Using the wrong metrics to gauge classification of highly imbalanced Big Data may hide important information in experimental results. However, we find that analysis of metrics for performance evaluation and what they can hide or reveal is rarely covered in related works. Therefore, we address that gap by analyzing multiple popular performance metrics on three Big Data classification tasks. To the best of our knowledge, we are the first to utilize three new Medicare insurance claims datasets which became publicly available in 2021. These datasets are all highly imbalanced. Furthermore, the datasets are comprised of completely different data. We evaluate the performance of five ensemble learners in the Machine Learning task of Medicare fraud detection. Random Undersampling (RUS) is applied to induce five class ratios. The classifiers are evaluated with both the Area Under the Receiver Operating Characteristic Curve (AUC), and Area Under the Precision Recall Curve (AUPRC) metrics. We show that AUPRC provides a better insight into classification performance. Our findings reveal that the AUC metric hides the performance impact of RUS. However, classification results in terms of AUPRC show RUS has a detrimental effect. We show that, for highly imbalanced Big Data, the AUC metric fails to capture information about precision scores and false positive counts that the AUPRC metric reveals. Our contribution is to show AUPRC is a more effective metric for evaluating the performance of classifiers when working with highly imbalanced Big Data.
Journal Article
A Cascade Ensemble Learning Model for Human Activity Recognition with Smartphones
by
Pan, Zhigeng
,
Jin, Linpeng
,
Xu, Shoujiang
in
Artificial intelligence
,
cascade ensemble learning model
,
Cellular telephones
2019
Human activity recognition (HAR) has gained lots of attention in recent years due to its high demand in different domains. In this paper, a novel HAR system based on a cascade ensemble learning (CELearning) model is proposed. Each layer of the proposed model is comprised of Extremely Gradient Boosting Trees (XGBoost), Random Forest, Extremely Randomized Trees (ExtraTrees) and Softmax Regression, and the model goes deeper layer by layer. The initial input vectors sampled from smartphone accelerometer and gyroscope sensor are trained separately by four different classifiers in the first layer, and the probability vectors representing different classes to which each sample belongs are obtained. Both the initial input data and the probability vectors are concatenated together and considered as input to the next layer’s classifiers, and eventually the final prediction is obtained according to the classifiers of the last layer. This system achieved satisfying classification accuracy on two public datasets of HAR based on smartphone accelerometer and gyroscope sensor. The experimental results show that the proposed approach has gained better classification accuracy for HAR compared to existing state-of-the-art methods, and the training process of the model is simple and efficient.
Journal Article
Short-Term Solar Irradiance Forecasting Using Random Forest-Based Models with a Focus on Mountain Locations
by
Paulescu, Eugenia
,
Paulescu, Marius
,
Velimirovici, Lucas
in
Accuracy
,
Artificial intelligence
,
Decision trees
2026
Photovoltaic (PV) power forecasting has become a key tool for the intelligent management of electrical grids. Since the largest source of error in PV power forecasting originates from uncertainties in solar irradiance prediction, improving the accuracy of solar irradiance forecasts has emerged as an active research topic. This study evaluates multiple random tree-based model versions using a challenging dataset collected at globally distributed stations, spanning elevations from sea level to nearly 4000 m and covering a wide range of climate classes. The originality of the study lies in the synergistic contribution of two elements: the innovative inclusion of diffuse irradiance among the predictors and a comparative analysis of forecast quality across lowland and mountainous locations. In such environments, accurate solar resource forecasting is particularly important for the intelligent management of stand-alone PV systems deployed at high altitudes and in remote, off-grid areas. Overall, the results identify Extremely Randomized Trees (XTRc) as the best-performing model. XTRc achieves Skill Scores ranging from 0.087 to 0.298 across individual stations. The model accuracy remains high even at mountain stations, provided that sky-condition variability is low.
Journal Article
RELoc: An Enhanced 3D WiFi Fingerprinting Indoor Localization Algorithm with RFECV Feature Selection
2026
The use of Artificial Intelligence (AI) algorithms has enhanced WiFi fingerprinting-based indoor localization. However, most existing approaches are limited to 2D coordinate estimation, which leads to significant performance declines in multi-floor environments due to vertical ambiguity and inadequate spatial modeling. This limitation reduces reliability in real-world applications where accurate indoor localization is essential. This study proposes RELoc, a new 3D indoor localization framework that integrates Recursive Feature Elimination with Cross-Validation (RFECV) for optimal Access Point (AP) selection and Extremely Randomized Trees (ERT) for precise 2D and 3D coordinate regression. The ERT hyperparameters are optimized using Bayesian optimization with Optuna’s Tree-structured Parzen Estimator (TPE) to ensure robust, stable, and accurate localization. Extensive evaluation on the SODIndoorLoc and UTSIndoorLoc datasets demonstrates that RELoc delivers superior performance in both 2D and 3D indoor localization. Specifically, RELoc achieves Mean Absolute Errors (MAEs) of 1.84 m and 4.39 m for 2D coordinate prediction on SODIndoorLoc and UTSIndoorLoc, respectively. When floor information is incorporated, RELoc improves by 33.15% and 26.88% over the 2D version on these datasets. Furthermore, RELoc outperforms state-of-the-art methods by 7.52% over Graph Neural Network (GNN) and 12.77% over Deep Neural Network (DNN) on SODIndoorLoc and 40.22% over Extra Tree (ET) on UTSIndoorLoc, showing consistent improvements across various indoor environments. This enhancement emphasizes the critical role of 3D modeling in achieving robust and spatially discriminative indoor localization.
Journal Article
Comparison and improvement of the predictability and interpretability with ensemble learning models in QSPR applications
2020
Ensemble learning helps improve machine learning results by combining several models and allows the production of better predictive performance compared to a single model. It also benefits and accelerates the researches in quantitative structure–activity relationship (QSAR) and quantitative structure–property relationship (QSPR). With the growing number of ensemble learning models such as random forest, the effectiveness of QSAR/QSPR will be limited by the machine’s inability to interpret the predictions to researchers. In fact, many implementations of ensemble learning models are able to quantify the overall magnitude of each feature. For example, feature importance allows us to assess the relative importance of features and to interpret the predictions. However, different ensemble learning methods or implementations may lead to different feature selections for interpretation. In this paper, we compared the predictability and interpretability of four typical well-established ensemble learning models (Random forest, extreme randomized trees, adaptive boosting and gradient boosting) for regression and binary classification modeling tasks. Then, the blending methods were built by summarizing four different ensemble learning methods. The blending method led to better performance and a unification interpretation by summarizing individual predictions from different learning models. The important features of two case studies which gave us some valuable information to compound properties were discussed in detail in this report. QSPR modeling with interpretable machine learning techniques can move the chemical design forward to work more efficiently, confirm hypothesis and establish knowledge for better results.
Journal Article
Combination of Hyperspectral and Machine Learning to Invert Soil Electrical Conductivity
by
Zamanian, Kazem
,
Zhang, Junhua
,
Hu, Yi
in
Agricultural production
,
Algorithms
,
Correlation coefficient
2022
An accurate estimation of soil electrical conductivity (EC) using hyperspectral techniques is of great significance for understanding the spatial distribution of solutes and soil salinization. Although spectral transformation has been widely used in data pre-processing, the performance of different pre-processing techniques (or combination methods) on different models of the same data set is still ambiguous. Moreover, extremely randomized trees (ERT) and light gradient boosting machine (LightGBM) models are new learning algorithms with good generalization performance (soil moisture and above-ground biomass), but are less studied in estimating soil salinity in the visible and near-infrared spectra. In this study, 130 soil EC data, soil measured hyperspectral data, topographic factors, conventional salinity indices such as Salinity Index 1, and two-band (2D) salinity indices such as ratio indices, were introduced. The five spectral pre-processing methods of standard normal variate (SNV), standard normal variate and detrend (SNV-DT), inverse (1/OR) (OR is original spectrum), inverse-log (Log(1/OR) and fractional order derivative (FOD) (range 0–2, with intervals of 0.25) were performed. A gradient boosting machine (GBM) was used to select sensitive spectral parameters. Models (extreme gradient boosting (XGBoost), LightGBM, random forest (RF), ERT, classification and regression tree (CART), and ridge regression (RR)) were used for inversion soil EC and model validation. The results reveal that the two-dimensional correlation coefficient highlighted EC more effectively than the one-dimensional. Under SNV and the second order derivative, the two-dimensional correlation coefficient increased by 0.286 and 0.258 compared to the one-dimension, respectively. The 13 characteristic factors of slope, NDI, SI-T, RI, profile curvature, DOA, plane curvature, SI (conventional), elevation, Int2, aspect, S1 and TWI provided 90% of the cumulative importance for EC using GBM. Among the six machine models, the ERT model performed the best for simulation (R2 = 0.98) and validation (R2 = 0.96). The ERT model showed the best performance among the EC estimation models from the reference data. The kriging map based on the ERT simulation showed a close relationship with the measured data. Our study selected the effective pre-processing methods (SNV and the 2 order derivative) using one- and two-dimensional correlation, 13 important factors and the ERT model for EC hyperspectral inversion. This provides a theoretical support for the quantitative monitoring of soil salinization on a larger scale using remote sensing techniques.
Journal Article
SDM6A: A Web-Based Integrative Machine-Learning Framework for Predicting 6mA Sites in the Rice Genome
by
Basith, Shaherin
,
Shin, Tae Hwan
,
Manavalan, Balachandran
in
Accuracy
,
Adenine
,
Computer applications
2019
DNA N6-adenine methylation (6mA) is an epigenetic modification in prokaryotes and eukaryotes. Identifying 6mA sites in rice genome is important in rice epigenetics and breeding, but non-random distribution and biological functions of these sites remain unclear. Several machine-learning tools can identify 6mA sites but show limited prediction accuracy, which limits their usability in epigenetic research. Here, we developed a novel computational predictor, called the Sequence-based DNA N6-methyladenine predictor (SDM6A), which is a two-layer ensemble approach for identifying 6mA sites in the rice genome. Unlike existing methods, which are based on single models with basic features, SDM6A explores various features, and five encoding methods were identified as appropriate for this problem. Subsequently, an optimal feature set was identified from encodings, and corresponding models were developed individually using support vector machine and extremely randomized tree. First, all five single models were integrated via ensemble approach to define the class for each classifier. Second, two classifiers were integrated to generate a final prediction. SDM6A achieved robust performance on cross-validation and independent evaluation, with average accuracy and Matthews correlation coefficient (MCC) of 88.2% and 0.764, respectively. Corresponding metrics were 4.7%–11.0% and 2.3%–5.5% higher than those of existing methods, respectively. A user-friendly, publicly accessible web server (http://thegleelab.org/SDM6A) was implemented to predict novel putative 6mA sites in rice genome.
Journal Article