Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
774
result(s) for
"imbalanced data classification"
Sort by:
SORAG: Synthetic Data Over-Sampling Strategy on Multi-Label Graphs
2022
In many real-world networks of interest in the field of remote sensing (e.g., public transport networks), nodes are associated with multiple labels, and node classes are imbalanced; that is, some classes have significantly fewer samples than others. However, the research problem of imbalanced multi-label graph node classification remains unexplored. This non-trivial task challenges the existing graph neural networks (GNNs) because the majority class can dominate the loss functions of GNNs and result in the overfitting of the majority class features and label correlations. On non-graph data, minority over-sampling methods (such as the synthetic minority over-sampling technique and its variants) have been demonstrated to be effective for the imbalanced data classification problem. This study proposes and validates a new hypothesis with unlabeled data over-sampling, which is meaningless for imbalanced non-graph data; however, feature propagation and topological interplay mechanisms between graph nodes can facilitate the representation learning of imbalanced graphs. Furthermore, we determine empirically that ensemble data synthesis through the creation of virtual minority samples in the central region of a minority and generation of virtual unlabeled samples in the boundary region between a minority and majority is the best practice for the imbalanced multi-label graph node classification task. Our proposed novel data over-sampling framework is evaluated using multiple real-world network datasets, and it outperforms diverse, strong benchmark models by a large margin.
Journal Article
Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network
2022
Deep learning algorithms have seen a massive rise in popularity for remote sensing over the past few years. Recently, studies on applying deep learning techniques to graph data in remote sensing (e.g., public transport networks) have been conducted. In graph node classification tasks, traditional graph neural network (GNN) models assume that different types of misclassifications have an equal loss and thus seek to maximize the posterior probability of the sample nodes under labeled classes. The graph data used in realistic scenarios tend to follow unbalanced long-tailed class distributions, where a few majority classes contain most of the vertices and the minority classes contain only a small number of nodes, making it difficult for the GNN to accurately predict the minority class samples owing to the classification tendency of the majority classes. In this paper, we propose a dual cost-sensitive graph convolutional network (DCSGCN) model. The DCSGCN is a two-tower model containing two subnetworks that compute the posterior probability and the misclassification cost. The model uses the cost as ”complementary information” in a prediction to correct the posterior probability under the perspective of minimal risk. Furthermore, we propose a new method for computing the node cost labels based on topological graph information and the node class distribution. The results of extensive experiments demonstrate that DCSGCN outperformed other competitive baselines on different real-world imbalanced long-tailed graphs.
Journal Article
SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling
2021
Many practical applications suffer from imbalanced data classification, in which case the minority class has degraded recognition rate. The primary causes are the sample scarcity of the minority class and the intrinsic complex distribution characteristics of imbalanced datasets. The imbalanced classification problem is more serious on small sample datasets. To solve the problems of small sample and class imbalance, a hybrid resampling method is proposed. The proposed method combines an oversampling approach (synthetic minority oversampling technique, SMOTE) and a novel data cleaning approach (weighted edited nearest neighbor rule, WENN). First, SMOTE generates synthetic minority class examples using linear interpolation. Then, WENN detects and deletes unsafe majority and minority class examples using weighted distance function and k-nearest neighbor (kNN) rule. The weighted distance function scales up a commonly used distance by considering local imbalance and spacial sparsity. Extensive experiments over synthetic and real datasets validate the superiority of the proposed SMOTE-WENN compared with three state-of-the-art resampling methods.
Journal Article
A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare
by
Liu, Chenang
,
Lin, Ying
,
Chen, Hua
in
Adaptive algorithms
,
Adaptive nearest neighborhood selection
,
Algorithms
2023
In many healthcare applications, datasets for classification may be highly imbalanced due to the rare occurrence of target events such as disease onset. The SMOTE (Synthetic Minority Over-sampling Technique) algorithm has been developed as an effective resampling method for imbalanced data classification by oversampling samples from the minority class. However, samples generated by SMOTE may be ambiguous, low-quality and non-separable with the majority class. To enhance the quality of generated samples, we proposed a novel self-inspected adaptive SMOTE (SASMOTE) model that leverages an adaptive nearest neighborhood selection algorithm to identify the “visible” nearest neighbors, which are used to generate samples likely to fall into the minority class. To further enhance the quality of the generated samples, an uncertainty elimination via self-inspection approach is introduced in the proposed SASMOTE model. Its objective is to filter out the generated samples that are highly uncertain and inseparable with the majority class. The effectiveness of the proposed algorithm is compared with existing SMOTE-based algorithms and demonstrated through two real-world case studies in healthcare, including risk gene discovery and fatal congenital heart disease prediction. By generating the higher quality synthetic samples, the proposed algorithm is able to help achieve better prediction performance (in terms of F1 score) on average compared to the other methods, which is promising to enhance the usability of machine learning models on highly imbalanced healthcare data.
Journal Article
DDNet: disaster damage detection for buildings based on dual-temporal joint attention network
2025
Rapid and accurate assessment of building damage is crucial for effective post-disaster emergency response. The use of pre-disaster and post-disaster satellite imagery is a common approach for detecting building damage. This task involves two essential subtasks: building localization and damage classification. In building localization, the imbalance between buildings and background, along with low recall rates, often leads to boundary deviations, which negatively impact the accuracy of subsequent damage classification. In damage classification, features from both pre-disaster and post-disaster images, combined with localization results, are used; however, variations in imaging modalities and insufficient feature extraction from temporal images can introduce interference and reduce classification performance. To address these challenges, we propose a novel two-stage network, referred to as DDNet. In the first stage, the building localization network utilizes differential upsampling connections to enhance detailed feature acquisition and employs a unified focal loss to mitigate class imbalance between buildings and background, thereby balancing precision and recall. In the second stage, a joint attention module is introduced to effectively mine features from pre-disaster and post-disaster images, leading to improved classification accuracy. Finally, a connected component analysis algorithm is applied to convert pixel-level detection results into building-level damage outputs. On the xBD dataset, the proposed framework achieves a total F1 score of 79.56%, an F1 localization score of 86.38%, and an F1 damage classification score of 76.64%.
Journal Article
One-class ensemble classifier for data imbalance problems
2022
Imbalanced data classification is an important issue in machine learning. Despite various studies, solving the data imbalance problem is still difficult. Since the oversampling method uses fake minority data, such a method is untrusted and causing security instability. The main objective of this paper is to improve accuracy for data imbalance classification without generating fake minority data. For this purpose, a reliable strategy is proposed using an ensemble of one-class classifiers. Such a classifier does not suffer data imbalance problems since the model learns from a single class. In particular, training data is split into minority and majority sets. Then, one-class classifiers are trained separately and applied to compute minority and majority scores for testing data. Finally, classification is made based on the combination of both scores. The proposed method is experimented with using imbalanced-learn datasets. Moreover, the result is compared with sampling methods via Decision Tree and K Nearest Neighbors classifiers. One-class ensemble classifier outperforms sampling methods in 20 datasets.
Journal Article
A Two-Stage Seismic Damage Assessment Method for Small, Dense, and Imbalanced Buildings in Remote Sensing Images
by
Xu, Yang
,
Zhang, Qiangqiang
,
Chen, Wenli
in
Accuracy
,
Artificial intelligence
,
Artificial neural networks
2022
Large-scale optical sensing and precise, rapid assessment of seismic building damage in urban communities are increasingly demanded in disaster prevention and reduction. The common method is to train a convolutional neural network (CNN) in a pixel-level semantic segmentation approach and does not fully consider the characteristics of the assessment objectives. This study developed a machine-learning-derived two-stage method for post-earthquake building location and damage assessment considering the data characteristics of satellite remote sensing (SRS) optical images with dense distribution, small size, and imbalanced numbers. It included a modified You Only Look Once (YOLOv4) object detection module and a support vector machine (SVM) based classification module. In the primary step, the multiscale features were successfully extracted and fused from SRS images of densely distributed buildings by optimizing the YOLOv4 model toward the network structures, training hyperparameters, and anchor boxes. The fusion improved multi-channel features, optimization of network structure and hyperparameters have significantly enhanced the average location accuracy of post-earthquake buildings. Thereafter, three statistics (i.e., the angular second moment, dissimilarity, and inverse difference moment) were further discovered to effectively extract the characteristic value for earthquake damage from located buildings in SRS optical images based on the gray level co-occurrence matrix. They were used as the texture features to distinguish damage intensities of buildings, using the SVM model. The investigated dataset included 386 pre- and post-earthquake SRS optical images of the 2017 Mexico City earthquake, with a resolution of 1024 × 1024 pixels. Results show that the average location accuracy of post-earthquake buildings exceeds 95.7% and that the binary classification accuracy for damage assessment reaches 97.1%. The proposed two-stage method was validated by its extremely high precision in respect of densely distributed small buildings, indicating the promising potential of computer vision in large-scale disaster prevention and reduction using SRS datasets.
Journal Article
Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models
by
Sánchez-Hernández, Fernando
,
Kraiem, Mohamed S.
,
Moreno-García, María N.
in
Algorithms
,
Cancer
,
Classification
2021
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.
Journal Article
A synergistic enhancement of the Ivy algorithm for GAN-based imbalanced classification
2025
The Ivy Algorithm (IVYA), a swarm intelligence algorithm inspired by plant growth, presents a novel framework for optimization. To unlock its full potential in complex, high-dimensional problems, it is crucial to address the fundamental challenge of balancing exploration and exploitation, which can impact overall search efficiency and solution quality. To this end, this paper proposes an Enhanced Ivy Algorithm (E-IVYA) that integrates three synergistic mechanisms. First, a dynamic perturbation framework combining symmetric and asymmetric exploration is introduced to maintain population diversity. Second, a dynamic escape mechanism based on elite differential mutation is employed to prevent search stagnation and effectively escape from local optima. Third, an adaptive movement strategy inspired by the Sine-Cosine Algorithm is integrated to achieve a more adaptive balance between global exploration and local exploitation. The performance of the proposed E-IVYA was rigorously evaluated through two distinct phases. Initially, its optimization capabilities were benchmarked against a wide range of classic and advanced algorithms on the challenging IEEE CEC 2014 and 2017 test suites. Subsequently, its practical utility was validated by applying it to the complex task of automating the hyperparameter optimization of Generative Adversarial Networks (GANs) for imbalanced data classification. The experimental results demonstrate E-IVYA’s superior performance. On the standard benchmarks, E-IVYA consistently ranked as a top-performing algorithm. In the practical application, the E-IVYA-optimized GAN model achieved a minority class F1-Score of 0.87 on the highly imbalanced Credit-Card Fraud dataset, significantly outperforming models augmented with standard techniques like SMOTE (0.71). These findings confirm that E-IVYA is a robust and efficient tool for tackling complex optimization problems, particularly in the domain of automated machine learning.
Journal Article
OUBoost: boosting based over and under sampling technique for handling imbalanced data
2023
Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. In this paper, we first propose a new technique for under-sampling based on the Peak clustering method from the majority class on imbalanced datasets. We then propose a novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampling technique (SMOTE) in the boosting procedure, named OUBoost. In the proposed OUBoost algorithm, misclassified examples are not given equal weights. OUBoost selects useful examples from the majority class and creates synthetic examples for the minority class. In fact, it indirectly updates the weights of samples. We designed experiments using several evaluation metrics, such as Recall, MCC, Gmean, and F-score on 30 real-world imbalanced datasets. The results show improved prediction performance in the minority class in most used datasets using OUBoost. We further report time comparisons and statistical tests to analyze our proposed algorithm in more details.
Journal Article