Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
1,256
result(s) for
"Imbalanced data"
Sort by:
DAD-Net: Classification of Alzheimer’s Disease Using ADASYN Oversampling Technique and Optimized Neural Network
by
Saqib Mahmood
,
Mian Muhammad Sadiq Fareed
,
Meng Joo Er
in
Accuracy
,
Advertising executives
,
Algorithms
2022
Alzheimer’s Disease (AD) is a neurological brain disorder that causes dementia and neurological dysfunction, affecting memory, behavior, and cognition. Deep Learning (DL), a kind of Artificial Intelligence (AI), has paved the way for new AD detection and automation methods. The DL model’s prediction accuracy depends on the dataset’s size. The DL models lose their accuracy when the dataset has an imbalanced class problem. This study aims to use the deep Convolutional Neural Network (CNN) to develop a reliable and efficient method for identifying Alzheimer’s disease using MRI. In this study, we offer a new CNN architecture for diagnosing Alzheimer’s disease with a modest number of parameters, making it perfect for training a smaller dataset. This proposed model correctly separates the early stages of Alzheimer’s disease and displays class activation patterns on the brain as a heat map. The proposed Detection of Alzheimer’s Disease Network (DAD-Net) is developed from scratch to correctly classify the phases of Alzheimer’s disease while reducing parameters and computation costs. The Kaggle MRI image dataset has a severe problem with class imbalance. Therefore, we used a synthetic oversampling technique to distribute the image throughout the classes and avoid the problem. Precision, recall, F1-score, Area Under the Curve (AUC), and loss are all used to compare the proposed DAD-Net against DEMENET and CNN Model. For accuracy, AUC, F1-score, precision, and recall, the DAD-Net achieved the following values for evaluation metrics: 99.22%, 99.91%, 99.19%, 99.30%, and 99.14%, respectively. The presented DAD-Net outperforms other state-of-the-art models in all evaluation metrics, according to the simulation results.
Journal Article
Processing imbalanced medical data at the data level with assisted-reproduction data as an example
2024
Objective
Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.
Methods
We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.
Results
The logistic model’s performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.
Conclusions
The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.
Journal Article
SORAG: Synthetic Data Over-Sampling Strategy on Multi-Label Graphs
2022
In many real-world networks of interest in the field of remote sensing (e.g., public transport networks), nodes are associated with multiple labels, and node classes are imbalanced; that is, some classes have significantly fewer samples than others. However, the research problem of imbalanced multi-label graph node classification remains unexplored. This non-trivial task challenges the existing graph neural networks (GNNs) because the majority class can dominate the loss functions of GNNs and result in the overfitting of the majority class features and label correlations. On non-graph data, minority over-sampling methods (such as the synthetic minority over-sampling technique and its variants) have been demonstrated to be effective for the imbalanced data classification problem. This study proposes and validates a new hypothesis with unlabeled data over-sampling, which is meaningless for imbalanced non-graph data; however, feature propagation and topological interplay mechanisms between graph nodes can facilitate the representation learning of imbalanced graphs. Furthermore, we determine empirically that ensemble data synthesis through the creation of virtual minority samples in the central region of a minority and generation of virtual unlabeled samples in the boundary region between a minority and majority is the best practice for the imbalanced multi-label graph node classification task. Our proposed novel data over-sampling framework is evaluated using multiple real-world network datasets, and it outperforms diverse, strong benchmark models by a large margin.
Journal Article
Survey on Highly Imbalanced Multi-class Data
by
Hamid, Mohd Hakim Abdul
,
Yusoff, Marina
,
Mohamed, Azlinah
in
Big Data
,
Classification
,
Cluster analysis
2022
Machine learning technology has a massive impact on society because it offers solutions to solve many complicated problems like classification, clustering analysis, and predictions, especially during the COVID-19 pandemic. Data distribution in machine learning has been an essential aspect in providing unbiased solutions. From the earliest literatures published on highly imbalanced data until recently, machine learning research has focused mostly on binary classification data problems. Research on highly imbalanced multi-class data is still greatly unexplored when the need for better analysis and predictions in handling Big Data is required. This study focuses on reviews related to the models or techniques in handling highly imbalanced multi-class data, along with their strengths and weaknesses and related domains. Furthermore, the paper uses the statistical method to explore a case study with a severely imbalanced dataset. This article aims to (1) understand the trend of highly imbalanced multi-class data through analysis of related literatures; (2) analyze the previous and current methods of handling highly imbalanced multi-class data; (3) construct a framework of highly imbalanced multi-class data. The chosen highly imbalanced multi-class dataset analysis will also be performed and adapted to the current methods or techniques in machine learning, followed by discussions on open challenges and the future direction of highly imbalanced multi-class data. Finally, for highly imbalanced multi-class data, this paper presents a novel framework. We hope this research can provide insights on the potential development of better methods or techniques to handle and manipulate highly imbalanced multi-class data.
Journal Article
Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network
2022
Deep learning algorithms have seen a massive rise in popularity for remote sensing over the past few years. Recently, studies on applying deep learning techniques to graph data in remote sensing (e.g., public transport networks) have been conducted. In graph node classification tasks, traditional graph neural network (GNN) models assume that different types of misclassifications have an equal loss and thus seek to maximize the posterior probability of the sample nodes under labeled classes. The graph data used in realistic scenarios tend to follow unbalanced long-tailed class distributions, where a few majority classes contain most of the vertices and the minority classes contain only a small number of nodes, making it difficult for the GNN to accurately predict the minority class samples owing to the classification tendency of the majority classes. In this paper, we propose a dual cost-sensitive graph convolutional network (DCSGCN) model. The DCSGCN is a two-tower model containing two subnetworks that compute the posterior probability and the misclassification cost. The model uses the cost as ”complementary information” in a prediction to correct the posterior probability under the perspective of minimal risk. Furthermore, we propose a new method for computing the node cost labels based on topological graph information and the node class distribution. The results of extensive experiments demonstrate that DCSGCN outperformed other competitive baselines on different real-world imbalanced long-tailed graphs.
Journal Article
Predictive performance of presence-only species distribution models
by
Valavi, Roozbeh
,
Guillera-Arroita, Gurutzeta
,
Lahoz-Monfort, José J.
in
Algorithms
,
boosted regression trees
,
data collection
2022
Species distribution modeling (SDM) is widely used in ecology and conservation. Currently, the most available data for SDM are species presence-only records (available through digital databases). There have been many studies comparing the performance of alternative algorithms for modeling presence-only data. Among these, a 2006 paper from Elith and colleagues has been particularly influential in the field, partly because they used several novel methods (at the time) on a global data set that included independent presence–absence records for model evaluation. Since its publication, some of the algorithms have been further developed and new ones have emerged. In this paper, we explore patterns in predictive performance across methods, by reanalyzing the same data set (225 species from six different regions) using updated modeling knowledge and practices. We apply well-established methods such as generalized additive models and MaxEnt, alongside others that have received attention more recently, including regularized regressions, point-process weighted regressions, random forests, XGBoost, support vector machines, and the ensemble modeling framework biomod. All the methods we use include background samples (a sample of environments in the landscape) for model fitting. We explore impacts of using weights on the presence and background points in model fitting. We introduce new ways of evaluating models fitted to these data, using the area under the precision-recall gain curve, and focusing on the rank of results. We find that the way models are fitted matters. The top method was an ensemble of tuned individual models. In contrast, ensembles built using the biomod framework with default parameters performed no better than single moderate performing models. Similarly, the second top performing method was a random forest parameterized to deal with many background samples (contrasted to relatively few presence records), which substantially outperformed other random forest implementations. We find that, in general, nonparametric techniques with the capability of controlling for model complexity outperformed traditional regression methods, with MaxEnt and boosted regression trees still among the top performing models. All the data and code with working examples are provided to make this study fully reproducible.
Journal Article
Boosting methods for multi-class imbalanced data classification: an experimental review
by
Abdi, Yousef
,
Asadpour, Mohammad
,
Razzaghi, Nazila
in
Algorithms
,
Big Data
,
Boosting algorithms
2020
Since canonical machine learning algorithms assume that the dataset has equal number of samples in each class, binary classification became a very challenging task to discriminate the minority class samples efficiently in imbalanced datasets. For this reason, researchers have been paid attention and have proposed many methods to deal with this problem, which can be broadly categorized into data level and algorithm level. Besides, multi-class imbalanced learning is much harder than binary one and is still an open problem. Boosting algorithms are a class of ensemble learning methods in machine learning that improves the performance of separate base learners by combining them into a composite whole. This paper’s aim is to review the most significant published boosting techniques on multi-class imbalanced datasets. A thorough empirical comparison is conducted to analyze the performance of binary and multi-class boosting algorithms on various multi-class imbalanced datasets. In addition, based on the obtained results for performance evaluation metrics and a recently proposed criteria for comparing metrics, the selected metrics are compared to determine a suitable performance metric for multi-class imbalanced datasets. The experimental studies show that the CatBoost and LogitBoost algorithms are superior to other boosting algorithms on multi-class imbalanced conventional and big datasets, respectively. Furthermore, the MMCC is a better evaluation metric than the MAUC and G-mean in multi-class imbalanced data domains.
Journal Article
Deep reinforcement learning for multi-class imbalanced training: applications in healthcare
by
Soltan, Andrew A. S.
,
Lachapelle, Alexander S.
,
Eyre, David W.
in
Algorithms
,
Artificial Intelligence
,
Case studies
2024
With the rapid growth of memory and computing power, datasets are becoming increasingly complex and imbalanced. This is especially severe in the context of clinical data, where there may be one rare event for many cases in the majority class. We introduce an imbalanced classification framework, based on reinforcement learning, for training extremely imbalanced data sets, and extend it for use in multi-class settings. We combine dueling and double deep Q-learning architectures, and formulate a custom reward function and episode-training procedure, specifically with the capability of handling multi-class imbalanced training. Using real-world clinical case studies, we demonstrate that our proposed framework outperforms current state-of-the-art imbalanced learning methods, achieving more fair and balanced classification, while also significantly improving the prediction of minority classes.
Journal Article
Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
2023
Machine Learning (ML) algorithms have been increasingly replacing people in several application domains—in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, cost-sensitive and ensemble learning. These solutions reduce the naturally occurring bias towards the majority sample through ML. This study uses a systematic mapping methodology to assess 9927 papers related to sampling techniques for ML in imbalanced data applications from 7 digital libraries. A filtering process selected 35 representative papers from various domains, such as health, finance, and engineering. As a result of a thorough quantitative analysis of these papers, this study proposes two taxonomies—illustrating sampling techniques and ML models. The results indicate that oversampling and classical ML are the most common preprocessing techniques and models, respectively. However, solutions with neural networks and ensemble ML models have the best performance—with potentially better results through hybrid sampling techniques. Finally, none of the 35 works apply simulation-based synthetic oversampling, indicating a path for future preprocessing solutions.
Journal Article
Deep MLP-CNN Model Using Mixed-Data to Distinguish between COVID-19 and Non-COVID-19 Patients
2020
The limitations and high false-negative rates (30%) of COVID-19 test kits have been a prominent challenge during the 2020 coronavirus pandemic. Manufacturing those kits and performing the tests require extensive resources and time. Recent studies show that radiological images like chest X-rays can offer a more efficient solution and faster initial screening of COVID-19 patients. In this study, we develop a COVID-19 diagnosis model using Multilayer Perceptron and Convolutional Neural Network (MLP-CNN) for mixed-data (numerical/categorical and image data). The model predicts and differentiates between COVID-19 and non-COVID-19 patients, such that early diagnosis of the virus can be initiated, leading to timely isolation and treatments to stop further spread of the disease. We also explore the benefits of using numerical/categorical data in association with chest X-ray images for screening COVID-19 patients considering both balanced and imbalanced datasets. Three different optimization algorithms are used and tested:adaptive learning rate optimization algorithm (Adam), stochastic gradient descent (Sgd), and root mean square propagation (Rmsprop). Preliminary computational results show that, on a balanced dataset, a model trained with Adam can distinguish between COVID-19 and non-COVID-19 patients with a higher accuracy of 96.3%. On the imbalanced dataset, the model trained with Rmsprop outperformed all other models by achieving an accuracy of 95.38%. Additionally, our proposed model outperformed selected existing deep learning models (considering only chest X-ray or CT scan images) by producing an overall average accuracy of 94.6% ± 3.42%.
Journal Article