Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
74
result(s) for
"under-sampling"
Sort by:
Business Anomaly Detection Method of Power Dispatching Automation System Based on Clustering Under-Sampling in the Boundary Region
2021
Timely detecting business anomaly in the power dispatching automation system is significant for the steady operation of the power grid. Though the imbalanced binary classification method in machine learning is an effective way to achieve the business anomaly detection of the system, the overlap of boundary samples is an urgent issue affecting the classification effect. An under-sampling method by removing the clustering noises of the majority samples in the boundary region is proposed. Firstly, KNN is used to search adjacent points of the majority class, and the boundary region and the safety region are divided according to the proportion of the majority samples in adjacent points. Secondly, DBSCAN is used to cluster the majority samples in the boundary region, and noise points are removed. Finally, it’s combined with the method based on model dynamic selection driven by data partition hybrid sampling (DPHS-MDS). The purpose of reducing the overlap degree of boundary samples, balancing the dataset and improving the classification effect is achieved. Experimental results show that the proposed method is superior to the relevant mainstream methods under F-measure and G-mean.
Journal Article
The effect of network size and sampling completeness in depauperate networks
2019
The accurate estimation of interaction network structure is essential for understanding network stability and function. A growing number of studies evaluate under‐sampling as the degree of sampling completeness (proportional richness observed). How the relationship between network structural metrics and sampling completeness varies across networks of different sizes remains unclear, but this relationship has implications for the within‐ and between‐system comparability of network structure. Here, we test the combined effects of network size and sampling completeness on the structure of spatially distinct networks (i.e., subwebs) in a host–parasitoid model system to better understand the within‐system variability in metric bias. Richness estimates were used to quantify a gradient of sampling completeness of species and interactions across randomly subsampled subwebs. The combined impacts of network size and sampling completeness on the estimated values of twelve unweighted and weighted network metrics were tested. The robustness of network metrics to under‐sampling was strongly related to network size, and sampling completeness of interactions were generally a better predictor of metric bias than sampling completeness of species. Weighted metrics often performed better than unweighted metrics at low sampling completeness; however, this was mainly evident at large rather than small subweb size. These outcomes highlight the significance of under‐sampling for the comparability of both unweighted and weighted network metrics when networks are small and vary in size. This has implications for within‐system comparability of species‐poor networks and, more generally, reveals problems with under‐sampling ecological networks that may otherwise be difficult to detect in species‐rich networks. To mitigate the impacts of under‐sampling, more careful considerations of system‐specific variation in metric bias are needed. Effects of under‐sampling are often overlooked in ecological network studies. The authors present an approach for evaluating within‐system comparability of network structure when subwebs vary in size. The results highlight the importance of considering within‐system variation in metric bias in species‐poor systems and overall problems with under‐sampling in ecological networks.
Journal Article
Evolutionary under-sampling based bagging ensemble method for imbalanced data classification
2018
In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve this problem, many effective approaches have been proposed. Among them, the bagging ensemble methods with integration of the under-sampling techniques have demonstrated better performance than some other ones including the bagging ensemble methods integrated with the over-sampling techniques, the cost-sensitive methods, etc. Although these under-sampling techniques promote the diversity among the generated base classifiers with the help of random partition or sampling for the majority class, they do not take any measure to ensure the individual classification performance, consequently affecting the achievability of better ensemble performance. On the other hand, evolutionary under-sampling EUS as a novel undersampling technique has been successfully applied in searching for the best majority class subset for training a goodperformance nearest neighbor classifier. Inspired by EUS, in this paper, we try to introduce it into the under-sampling bagging framework and propose an EUS based bagging ensemble method EUS-Bag by designing a new fitness function considering three factors to make EUS better suited to the framework. With our fitness function, EUS-Bag could generate a set of accurate and diverse base classifiers. To verify the effectiveness of EUS-Bag, we conduct a series of comparison experiments on 22 two-class imbalanced classification problems. Experimental results measured using recall, geometric mean and AUC all demonstrate its superior performance.
Journal Article
Modelling the distribution of rare invertebrates by correcting class imbalance and spatial bias
2022
Aim Soil arthropods are important decomposers and nutrient cyclers, but are poorly represented on national and international conservation Red Lists. Opportunistic biological records for soil invertebrates are sparse, and contain few observations of rare species but a relatively large number of non‐detection observations (a problem known as class imbalance). Robinson et al. (Diversity and Distributions, 24, 460) proposed a method for under‐sampling non‐detection data using a spatial grid to improve class balance and spatial bias in bird data. For taxa that are less intensively sampled, datasets are smaller, which poses a challenge because under‐sampling data removes information. We tested whether spatially stratified under‐sampling improved prediction performance of species distribution models for millipedes, for which large datasets are not available. We also tested whether using environmental predictor variables provided additional information beyond what is captured by spatial position for predicting species distributions. Location Island of Ireland. Methods We tested the spatially stratified under‐sampling method of Robinson et al. (Diversity and Distributions, 24, 460) by using biological records to train species distribution models of rare millipedes. Results Using spatially stratified under‐sampled data improved species distribution model sensitivity (true positive rate) but decreased model specificity (true negative rate). The spatial pattern of under‐sampling affected model performance. Training data that was under‐sampled in a spatially stratified way sometimes produced worse models than did data that was under‐sampled in an unstratified way. Geographic coordinates were as good as or better than environmental variables for predicting distributions of one out of six species. Main Conclusions Spatially stratified under‐sampling improved prediction performance of species distribution models for rare millipedes. Spatially stratified under‐sampling was most effective for rarer species, although unstratified under‐sampling was sometimes more effective. The good prediction performance of models using geographic coordinates is promising for modelling distributions of poorly studied species for which little is known about ecological or physiological determinants of occurrence.
Journal Article
A gradient boosting-based mortality prediction model for COVID-19 patients
2023
The COVID-19 pandemic has been a global public health concern since March 11, 2020. Healthcare systems struggled to meet patients’ growing needs for diagnosis, treatment, and care. As healthcare industries struggled to cope with the overwhelming demands, advanced intelligence and computing technologies have become essential. Artificial intelligence techniques have become essential for identifying and triaging patients, predicting disease severity, and detecting outcomes. The aim of the paper is to propose a gradient boosting-based model to predict the mortality of COVID-19 patients and to improve the prediction accuracy by incorporating resampling strategies. A real COVID-19 data that includes patients’ travel, health, geographical, and demographic information is obtained from a public repository. The dataset used in the study has the class imbalance problem, and several approaches are applied to solve the problem. In this study, a gradient boosting-based model for predicting the mortality of COVID-19 patients is proposed. This approach incorporates resampling strategies, such as synthetic minority oversampling technique (SMOTE), random under-sampling, and clustering-based under-sampling, to address the imbalanced class distribution problem in the dataset. Then, gradient boosting machines (GBM) such as extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost) are analyzed in terms of accuracy and computational time. Random search method is used to find the optimal hyper-parameters for the algorithms. A stacking-based hybrid model that combines the XGBoost, LightGBM, and CatBoost algorithms was used for comparison in the experiments. In the experiments, the factors that can influence the mortality of COVID-19 patients are investigated. And, it is found that the age of the patient, whether the patient belonged to Wuhan, the difference between when they first noticed symptoms and when they visited the hospital (in days) affect the mortality. By utilizing over/under-sampling approaches, we ameliorated the concern of class imbalance. XGBoost, LightGBM, and CatBoost are effectively analyzed in terms of various performance metrics to determine the suitable GBM for the proposed system. The experimental results revealed that the stacking-based hybrid model performs well with the balanced dataset provided by SMOTE. CatBoost produces superior results for a balanced dataset with random under-sampling and clustering-based under-sampling. The main focus of the study is to propose a gradient boosting-based model for predicting the mortality of COVID-19 patients. This study also emphasizes the importance of addressing the imbalanced class distribution problem in the dataset and incorporates resampling strategies to improve the prediction accuracy. Our promising result confirms the success of the proposed system in predicting mortality of COVID-19 disease.
Journal Article
Under Sampling Techniques for Handling Unbalanced Data with Various Imbalance Rates: A Comparative Study
2024
Unbalanced data sets represent data sets that contain an unequal number of examples for different classes. This dataset represents a problem faced by machine learning tools; as in datasets with high imbalance ratios, false negative rate per-centages will be increased because most classifiers will be affected by the major class. Choosing specific evaluation metrics that are most informative and sampling techniques represent a common way to handle this problem. In this paper, a comparative analysis between four of the most common under-sampling techniques is conducted over datasets with various imbalance rates (IR) range from low to medium to high IR. Decision Tree classifier and twelve imbalanced data sets with various IR are used for evaluating the effects of each technique depending on Recall, F1-measure, gmean, recall for minor class, and F1-measure for minor class evaluation metrics. Results demonstrate that Clusters Centroid outperformed Neighborhood Cleaning Rule (NCL) based on recall for all low IR datasets. For both medium, and high IR datasets NCL, and Random Under Sampling (RUS) outperformed the rest techniques, while Tomek Link has the worst effect.
Journal Article
Ensemble Deep Learning Models for Heart Disease Classification: A Case Study from Mexico
by
Baccouche, Asma
,
Castillo Olea, Cristian
,
Elmaghraby, Adel
in
Accuracy
,
Algorithms
,
Artificial intelligence
2020
Heart diseases are highly ranked among the leading causes of mortality in the world. They have various types including vascular, ischemic, and hypertensive heart disease. A large number of medical features are reported for patients in the Electronic Health Records (EHR) that allow physicians to diagnose and monitor heart disease. We collected a dataset from Medica Norte Hospital in Mexico that includes 800 records and 141 indicators such as age, weight, glucose, blood pressure rate, and clinical symptoms. Distribution of the collected records is very unbalanced on the different types of heart disease, where 17% of records have hypertensive heart disease, 16% of records have ischemic heart disease, 7% of records have mixed heart disease, and 8% of records have valvular heart disease. Herein, we propose an ensemble-learning framework of different neural network models, and a method of aggregating random under-sampling. To improve the performance of the classification algorithms, we implement a data preprocessing step with features selection. Experiments were conducted with unidirectional and bidirectional neural network models and results showed that an ensemble classifier with a BiLSTM or BiGRU model with a CNN model had the best classification performance with accuracy and F1-score between 91% and 96% for the different types of heart disease. These results are competitive and promising for heart disease dataset. We showed that ensemble-learning framework based on deep models could overcome the problem of classifying an unbalanced heart disease dataset. Our proposed framework can lead to highly accurate models that are adapted for clinical real data and diagnosis use.
Journal Article
Spatio-Temporal Agnostic Sampling for Imbalanced Multivariate Seasonal Time Series Data: A Study on Forest Fires
by
Zaman, Marzia
,
Sampalli, Srinivas
,
Purcell, Richard
in
Algorithms
,
big data analytics
,
Canada
2025
Natural disasters are mostly seasonal and caused by anthropological, climatic, and geological factors that impact human life, economy, ecology, and natural resources. This paper focuses on increasingly widespread forest fires, causing greater destruction in recent years. Data obtained from sensors for predicting forest fires and assessing fire severity, i.e., area burned, are multivariate, seasonal, and highly imbalanced with a ratio of 100,000+ non-fire events to 1 fire event. This paper presents Spatio-Temporal Agnostic Sampling (STAS) to overcome the challenge of highly imbalanced data. This paper first presents a mathematical understanding of fire and non-fire events and then a thorough complexity analysis of the proposed STAS framework and two existing methods, NearMiss and SMOTE. Further, to investigate the applicability of STAS, binary classification models (to determine the probability of forest fire) and regression models (to assess the severity of forest fire) were built on the data generated from STAS. A total of 432 experiments were conducted to validate the robustness of the STAS parameters. Additional experiments with a temporal data split were conducted to further validate the results. The results show that 180 of the 216 binary classification models had an F1score>0.9 and 150 of the 216 regression models had an R2score>0.75. These results indicate the applicability of STAS for fire prediction with highly imbalanced multivariate seasonal time series data.
Journal Article
Customised-sampling approach for pipe failure prediction in water distribution networks
2024
This paper presents a new methodology for addressing imbalanced class data for failure prediction in Water Distribution Networks (WDNs). The proposed methodology relies on existing approaches including under-sampling, over-sampling, and class weighting as primary strategies. These techniques aim to treat the imbalanced datasets by adjusting the representation of minority and majority classes. Under-sampling reduces data in the majority class, over-sampling adds data to the minority class, and class weighting assigns unequal weights based on class counts to balance the influence of each class during machine learning (ML) model training. In this paper, the mentioned approaches were used at levels other than “balance point” to construct pipe failure prediction models for a WDN with highly imbalanced data. F1-score, and AUC–ROC, were selected to evaluate model performance. Results revealed that under-sampling above the balance point yields the highest F1-score, while over-sampling below the balance point achieves optimal results. Employing class weights during training and prediction emphasises the efficacy of lower weights than the balance. Combining under-sampling and over-sampling to the same ratio for both majority and minority classes showed limited improvement. However, a more effective predictive model emerged when over-sampling the minority class and under-sampling the majority class to different ratios, followed by applying class weights to balance data.
Journal Article
Comparison of biopsy under‐sampling and annual progression using hidden markov models to learn from prostate cancer active surveillance studies
by
Denton, Brian T.
,
Nieboer, Daan
,
Morgan, Todd M.
in
active surveillance
,
Biopsy
,
biopsy under‐sampling
2020
This study aimed to estimate the rates of biopsy undersampling and progression for four prostate cancer (PCa) active surveillance (AS) cohorts within the Movember Foundation's Global Action Plan Prostate Cancer Active Surveillance (GAP3) consortium. We used a hidden Markov model (HMM) to estimate factors that define PCa dynamics for men on AS including biopsy under‐sampling and progression that are implied by longitudinal data in four large cohorts included in the GAP3 database. The HMM was subsequently used as the basis for a simulation model to evaluate the biopsy strategies previously proposed for each of these cohorts. For the four AS cohorts, the estimated annual progression rate was between 6%–13%. The estimated probability of a biopsy successfully sampling undiagnosed non‐favorable risk cancer (biopsy sensitivity) was between 71% and 80%. In the simulation study of patients diagnosed with favorable risk cancer at age 50, the mean number of biopsies performed before age 75 was between 4.11 and 12.60, depending on the biopsy strategy. The mean delay time to detection of non‐favorable risk cancer was between 0.38 and 2.17 years. Biopsy undersampling and progression varied considerably across study cohorts. There was no single best biopsy protocol that is optimal for all cohorts, because of the variation in biopsy under‐sampling error and annual progression rates across cohorts. All strategies demonstrated diminishing benefits from additional biopsies. There was no single best biopsy protocol in prostate cancer active surveillance that is optimal for all cohorts. The optimal biopsy strategy depends on biopsy under‐sampling error and cancer progression rate, which vary significantly across cohorts.
Journal Article