Catalogue Search | MBRL

Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy

by Varotto, Giulia , Susi, Gianluca , Panzica, Ferruccio in Classification , Convulsions & seizures , Datasets

2021

Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery. Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered. Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method. Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

Journal Article

Share this book

Add to My Shelf

Improved Viola-Jones face detection algorithm based on HoloLens

by Shang, Yunyi , Huang, Jing , Chen, Hai in Algorithms , Business administration , Convolution

2019

The current face detection in Microsoft HoloLens can only be achieved by remote call of face detection interface algorithm which is, however, restricted by network, resulting in slow detection and failing to meet real-time detection demand. This paper proposes an improved Viola-Jones algorithm of face detection based on HoloLens upgrading classical Viola-Jones face detection algorithm relying on Haar-like rectangle feature expansion to enhance detection efficiency, and accelerating detection building on two-dimensional convolution separation and image re-sampling technique. The detection efficiency of improved face detection algorithm is 12% on average higher than that of existing face detection interface algorithm, and its detection speed is four-folded. Moreover, HoloLens depth camera enables 3D face detection and location, and its unique gaze, voice, and gesture interacting techniques free the hands, thereby realizing easier and less-burdened man-computer interaction. HoloLens furnished with real-time video face detection algorithm as detailed in this paper can be applied in such fields as social contact, public security, and business management.

Journal Article

Share this book

Add to My Shelf

Risk factor analysis of device-related infections: value of re-sampling method on the real-world imbalanced dataset

by Tan, Li-Zhuang , Feng, Xiang-Fei , Yang, Ling-Chao in Aged , Aged, 80 and over , Antibiotics

2019

Background The incidence of cardiac implantable electronic device infection (CIEDI) is low and usually belongs to the typical imbalanced dataset. We sought to describe our experience on the management of the imbalanced CIEDI dataset. Methods Database from two centers of patients undergoing device implantation from 2001 to 2016 were reviewed retrospectively. Re-sampling technique was used to improve the classifier accuracy. Results CIEDI was identified in 28 out of 4959 procedures (0.56%); a high imbalance existed in the sizes of the patient profiles. In univariate analyses, replacement procedure and male were significantly associated with an increase in CIEDI: (53.6% vs. 23.4, 0.8% vs. 0.3%, P < 0.01). Multivariate logistic regression analysis showed that gender (odds ratio, OR = 3.503), age (OR = 1.032), replacement procedure (OR = 3.503), and use of antibiotics (OR = 0.250) remained as independent predictors of CIEDI (all P < 0.05) after adjustment for diabetes, post-operation fever, and device style, device company. There were 616 under-sampled cases and 123 over-sampled cases in the analyzed cohort after re-sampling. The re-sampling and bootstrap results were robust and largely like the analysis results prior re-sampling method, while use of antibiotics lost the predicting capacity for CIEDI after re-sampling technique ( P > 0.05). Conclusion The application of re-sampling techniques can generate useful synthetic samples for the classification of imbalanced data and improve the accuracy of predicting efficacy of CIEDI. The peri-operative assessment should be intensified in male and aged patients as well as patients receiving replacement procedures for the risk of CIEDI.

Journal Article

Share this book

Add to My Shelf

Characterization of Non-Gaussian Geologic Facies Distribution Using Ensemble Kalman Filter with Probability Weighted Re-Sampling

by Leung, Juliana , Nejadi, Siavash , Trivedi, Japan in Boundaries , Chemistry and Earth Sciences , Computer Science

2015

The Ensemble Kalman Filter (EnKF) is a Monte Carlo-based technique for assisted history matching and real-time updating of reservoir models. However, it often fails to detect precise locations of distinct facies boundaries and their proportions, as the facies distributions are non-Gaussian, while geologic data for reservoir modeling is usually insufficient. In this paper, a new re-sampling step is introduced to the conventional EnKF formulation; after certain number of assimilation steps, the updated ensemble is used to generate a new ensemble with a novel probability weighted re-sampling scheme. The new ensemble samples from a probability density function that is conditional to both the geological information and the early production data. After the re-sampling step, the forecast model is applied to the new ensemble from the beginning up to the last update step (without any intermediate Kalman updates). Full EnKF is again applied on the ensemble members to assimilate the remaining production history. Combination of EnKF and regenerating new members using the re-sampling method demonstrates reasonable improvement and reduction of uncertainty in history matching of reservoir models with multiple facies. The histogram and the experimental variogram of the updated ensemble members are more consistent with the static geologic information. Moreover, the technique helps maintaining ensemble variance which is essential for uncertainty estimation in the posterior probability distribution of facies proportions.

Journal Article

Share this book

Add to My Shelf

Inference for current leukemia free survival

by Klein, John P. , Logan, Brent , Liu, Leiyan in Blood & organ donations , Bone marrow , Chemotherapy

2008

Donor lymphocyte infusion (DLI) for patients who relapse following an allogeneic stem cell transplant has proved remarkably durable. Because of the potential for second remissions with DLI, the current leukemia free survival (CLFS), which is the probability that a patient has not failed the entire course of the treatment, is becoming of interest to clinical investigators. Based on either a multistate Markov model or a linear combination of Kaplan-Meier estimators, we explore regression models for the CLFS. We focus on the two sample problem and we develop confidence bands for the CLFS or for differences in CLFS as well as a Kolmogorov type hypothesis test using a re-sampling technique. We also examine the use of pseudo-values to make inference on the direct effects of covariates on the CLFS function and we develop a score test for the equality of two CLFS. We illustrate these inference methods on a bone marrow transplant dataset.

Journal Article

Share this book

Add to My Shelf

A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems

by Aickelin, Uwe , Khorshidi, Hadi Akbarzadeh , Yang, Yuxuan in Algorithms , Cancer , Cardiovascular disease

2024

There has been growing attention to multi-class classification problems, particularly those challenges of imbalanced class distributions. To address these challenges, various strategies, including data-level re-sampling treatment and ensemble methods, have been introduced to bolster the performance of predictive models and Artificial Intelligence (AI) algorithms in scenarios where excessive level of imbalance is present. While most research and algorithm development have been focused on binary classification problems, in health informatics there is an increased interest in the field to address the problem of multi-class classification in imbalanced datasets. Multi-class imbalance problems bring forth more complex challenges, as a delicate approach is required to generate synthetic data and simultaneously maintain the relationship between the multiple classes. The aim of this review paper is to examine over-sampling methods tailored for medical and other datasets with multi-class imbalance. Out of 2,076 peer-reviewed papers identified through searches, 197 eligible papers were chosen and thoroughly reviewed for inclusion, narrowing to 37 studies being selected for in-depth analysis. These studies are categorised into four categories: metric, adaptive, structure-based, and hybrid approaches. The most significant finding is the emerging trend toward hybrid resampling methods that combine the strengths of various techniques to effectively address the problem of imbalanced data. This paper provides an extensive analysis of each selected study, discusses their findings, and outlines directions for future research.

Journal Article

Share this book

Add to My Shelf

Modeling mobile apps user behavior using Bayesian networks

by Dharmasena, Isuru , Domaratzki, Mike , Muthukumarana, Saman in Algorithms , Applications programs , Bayesian analysis

2021

Modern apps based businesses are increasingly interested in data driven decision making to achieve business goals as well as retaining their customer base. In this paper, we propose a Bayesian network approach to assess the mobile apps user behavior. We propose a strategy to build Bayesian networks and further improve the causal networks using re-sampling methods to best represent the causal representation between app user retention and in-app features. Structural hamming distances (SHD) are then used for assessing similar Bayesian network structures learned using data available from a local mobile app developing company. We also conduct a simulation study to assess the effect of re-sampling techniques towards the Bayesian network performance with various learning algorithms.

Journal Article

Share this book

Add to My Shelf

Biogeography-based optimization in noisy environments

by Chen, Zixiang , Simon, Dan , Ma, Haiping in Benchmarking , Biogeography , Evolution

2015

Biogeography-based optimization (BBO) is a new evolutionary optimization algorithm that is based on the science of biogeography. In this paper, BBO is applied to the optimization of problems in which the fitness function is corrupted by random noise. Noise interferes with the BBO immigration rate and emigration rate, and adversely affects optimization performance. We analyse the effect of noise on BBO using a Markov model. We also incorporate re-sampling in BBO, which samples the fitness of each candidate solution several times and calculates the average to alleviate the effects of noise. BBO performance on noisy benchmark functions is compared with particle swarm optimization (PSO), differential evolution (DE), self-adaptive DE (SaDE) and PSO with constriction (CPSO). The results show that SaDE performs best and BBO performs second best. In addition, BBO with re-sampling is compared with Kalman filter-based BBO (KBBO). The results show that BBO with re-sampling achieves almost the same performance as KBBO but consumes less computational time.

Journal Article

Share this book

Add to My Shelf

The interpolation accuracy for seven soil properties at various sampling scales on the Loess Plateau, China

by Shao, Mingan , Gao, Lei in Coefficient of variation , Earth and Environmental Science , Environment

2012

Purpose Knowledge of the changes in interpolation accuracy with changing sampling scales is important when designing an appropriate sampling strategy. The objectives of this study were (1) to analyze the changes in interpolation accuracy with changing sampling scales for seven soil properties and (2) to find a suitable index that could predict the interpolation accuracy well. Materials and methods Nine hundred sixty-one samples were collected from a 30 × 30-m area. Seven soil properties were measured for each sample. Using a re-sampling analysis method, we grouped the samples under 16 subscales. Then, we divided the 16 subscales into two subsets, the first consisting of eight scales used as training sets and the second having the other eight scales as validation sets. Using the training sets, the interpolation accuracy and the contribution rate (CR) for the seven soil properties were compared and the relations of the interpolation accuracy to the coefficient of variation (CV), or to the ratio of sampling spacing to correlated range (S/R), or to the extent and spacing (E & S) were determined, the accuracy of prediction of which were then tested using the validation sets. Results and discussion The results showed that the mean interpolation accuracies varied greatly for different soil properties, with mean G values of training sets ranging from 2.4% for soil organic carbon, to 62.1% for sand content. With increasing sampling spacing or decreasing sampling extent, the interpolation accuracy decreased for all soil properties. The scales with the largest CR were not consistent with those with the highest interpolation accuracies. The interpolation accuracy was predicted better by E & S than by CV or by S/R. Conclusions The measurement and analysis gave insight into the changes of interpolation accuracy and CR at various sampling scales. Predicting interpolation accuracy based on the scale parameters of sampling spacing and sampling extent was feasible, which provided a useful means by which to determine appropriate sample size and sampling strategy.

Journal Article

Share this book

Add to My Shelf

Nonlinear process modeling of fructosyltransferase (FTase) using bootstrap re-sampling neural network model

by Ahmad, Zainal , Mat Noor, Rabiatul Adawiah , Mat Don, Mashitah in Biological and medical sciences , Biotechnology , Bootstrap method

2010

Recently, the increased demand of fructooligosaccharides (FOS) as a functional food has alarmed researchers to screen and identify new strains capable of producing fructosyltransferase (FTase). FTase is the enzyme that converts the substrate (sucrose) to glucose and fructose. The characterization of complex sugar such as table sugar, brown sugar, molasses, etc. will be carried out and the sugar that contained the highest sucrose concentration will be selected as a substrate. Eight species of macro-fungi will be screened for its ability to produce FTase and only one strain with the highest FTase activity will be selected for further studies. In this work, neural networks (NN) have been chosen to model the process based on their excellent ‘resume' in coping with nonlinear process. Bootstrap re-sampling method has been utilized in re-sampling the data in this work. This method has successfully modeled the process as shown in the results.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter