Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Reading LevelReading Level
-
Content TypeContent Type
-
YearFrom:-To:
-
More FiltersMore FiltersItem TypeIs Full-Text AvailableSubjectPublisherSourceDonorLanguagePlace of PublicationContributorsLocation
Done
Filters
Reset
42,558
result(s) for
"Data sampling"
Sort by:
Hyperparameter Optimization and Combined Data Sampling Techniques in Machine Learning for Customer Churn Prediction: A Comparative Analysis
2023
This paper explores the application of various machine learning techniques for predicting customer churn in the telecommunications sector. We utilized a publicly accessible dataset and implemented several models, including Artificial Neural Networks, Decision Trees, Support Vector Machines, Random Forests, Logistic Regression, and gradient boosting techniques (XGBoost, LightGBM, and CatBoost). To mitigate the challenges posed by imbalanced datasets, we adopted different data sampling strategies, namely SMOTE, SMOTE combined with Tomek Links, and SMOTE combined with Edited Nearest Neighbors. Moreover, hyperparameter tuning was employed to enhance model performance. Our evaluation employed standard metrics, such as Precision, Recall, F1-score, and the Receiver Operating Characteristic Area Under Curve (ROC AUC). In terms of the F1-score metric, CatBoost demonstrates superior performance compared to other machine learning models, achieving an outstanding 93% following the application of Optuna hyperparameter optimization. In the context of the ROC AUC metric, both XGBoost and CatBoost exhibit exceptional performance, recording remarkable scores of 91%. This achievement for XGBoost is attained after implementing a combination of SMOTE with Tomek Links, while CatBoost reaches this level of performance after the application of Optuna hyperparameter optimization.
Journal Article
Unrestricted mixed data sampling (MIDAS): MIDAS regressions with unrestricted lag polynomials
by
Foroni, Claudia
,
Schumacher, Christian
,
Marcellino, Massimiliano
in
Aggregation
,
Data sampling
,
Distributed lag polynomals
2015
Mixed data sampling (MIDAS) regressions allow us to estimate dynamic equations that explain a low frequency variable by high frequency variables and their lags. When the difference in sampling frequencies between the regressand and the regressors is large, distributed lag functions are typically employed to model dynamics avoiding parameter proliferation. In macroeconomic applications, however, differences in sampling frequencies are often small. In such a case, it might not be necessary to employ distributed lag functions. We discuss the pros and cons of unrestricted lag polynomials in MIDAS regressions. We derive unrestricted-MIDAS (U-MIDAS) regressions from linear high frequency models, discuss identification issues and show that their parameters can be estimated by ordinary least squares. In Monte Carlo experiments, we compare U-MIDAS with MIDAS with functional distributed lags estimated by non-linear least squares. We show that U-MIDAS performs better than MIDAS for small differences in sampling frequencies. However, with large differing sampling frequencies, distributed lag functions outperform unrestricted polynomials. The good performance of U-MIDAS for small differences in frequency is confirmed in empirical applications on nowcasting and short-term forecasting euro area and US gross domestic product growth by using monthly indicators.
Journal Article
Impact of Data Processing Techniques on AI Models for Attack-Based Imbalanced and Encrypted Traffic within IoT Environments
by
Kim, Hwankuk
,
Won, Chaeeun
,
Kim, Yeasul
in
Artificial intelligence
,
Data processing
,
Data sampling
2026
With the increasing emphasis on personal information protection, encryption through security protocols has emerged as a critical requirement in data transmission and reception processes. Nevertheless, IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices, spanning a range of devices from non-encrypted ones to fully encrypted ones. Given the limited visibility into payloads in this context, this study investigates AI-based attack detection methods that leverage encrypted traffic metadata, eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices. Using the UNSW-NB15 and CICIoT-2023 dataset, encrypted and unencrypted traffic were categorized according to security protocol, and AI-based intrusion detection experiments were conducted for each traffic type based on metadata. To mitigate the problem of class imbalance, eight different data sampling techniques were applied. The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning (DL) models from various perspectives. The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic. In the UNSW-NB15 dataset, the f1-score of encrypted traffic was approximately 0.98, which is 4.3% higher than that of unencrypted traffic (approximately 0.94). In addition, analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43, indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance. Furthermore, when data sampling techniques were applied to encrypted traffic, the recall in the UNSW-NB15 (Encrypted) dataset improved by up to 23.0%, and in the CICIoT-2023 (Encrypted) dataset by 20.26%, showing a similar level of improvement. Notably, in CICIoT-2023, f1-score and Receiver Operation Characteristic-Area Under the Curve (ROC-AUC) increased by 59.0% and 55.94%, respectively. These results suggest that data sampling can have a positive effect even in encrypted environments. However, the extent of the improvement may vary depending on data quality, model architecture, and sampling strategy.
Journal Article
Class-Difficulty Based Methods for Long-Tailed Visual Recognition
by
Ohashi, Hiroki
,
Nakamura, Katsuyuki
,
Sinha, Saptarshi
in
Artificial neural networks
,
Data sampling
,
Datasets
2022
Long-tailed datasets are very frequently encountered in real-world use cases where few classes or categories (known as majority or head classes) have higher number of data samples compared to the other classes (known as minority or tail classes). Training deep neural networks on such datasets gives results biased towards the head classes. So far, researchers have come up with multiple weighted loss and data re-sampling techniques in efforts to reduce the bias. However, most of such techniques assume that the tail classes are always the most difficult classes to learn and therefore need more weightage or attention. Here, we argue that the assumption might not always hold true. Therefore, we propose a novel approach to dynamically measure the instantaneous difficulty of each class during the training phase of the model. Further, we use the difficulty measures of each class to design a novel weighted loss technique called ‘class-wise difficulty based weighted (CDB-W) loss’ and a novel data sampling technique called ‘class-wise difficulty based sampling (CDB-S)’. To verify the wide-scale usability of our CDB methods, we conducted extensive experiments on multiple tasks such as image classification, object detection, instance segmentation and video-action classification. Results verified that CDB-W loss and CDB-S could achieve state-of-the-art results on many class-imbalanced datasets such as ImageNet-LT, LVIS and EGTEA, that resemble real-world use cases.
Journal Article
An Efficient and Accurate Convolution-Based Similarity Measure for Uncertain Trajectories
2023
With the rapid development of localization techniques and the prevalence of mobile devices, massive amounts of trajectory data have been generated, playing essential roles in areas of user analytics, smart transportation, and public safety. Measuring trajectory similarity is one of the fundamental tasks in trajectory analytics. Although considerable research has been conducted on trajectory similarity, the majority of existing approaches measure the similarity between two trajectories by calculating the distance between aligned locations, leading to challenges related to uncertain trajectories (e.g., low and heterogeneous data sampling rates, as well as location noise). To address these challenges, we propose Contra, a convolution-based similarity measure designed specifically for uncertain trajectories. The main focus of Contra is to identify the similarity of trajectory shapes while disregarding the time/order relevance of each record within the trajectory. To this end, it leverages a series of convolution and pooling operations to extract high-level geo-information from trajectories, and subsequently compares their similarities based on these extracted features. Moreover, we introduce efficient trajectory index strategies to enhance the computational efficiency of our proposed measure. We conduct comprehensive experiments on two trajectory datasets to evaluate the performance of our proposed approach. The experiments on both datasets show the effectiveness and efficiency of our approach. Specifically, the mean rank of Contra is 3 times better than the state-of-the-art approaches, and the precision of Contra surpasses baseline approaches by 20–40%.
Journal Article
Association between ambient air pollutants and upper respiratory tract infection and pneumonia disease burden in Thailand from 2000 to 2022: a high frequency ecological analysis
by
Koo, Joel Ruihan
,
Janhavi, A.
,
Lim, Jue Tao
in
Aerosols
,
Air Pollutants - adverse effects
,
Air Pollutants - analysis
2023
Background
A pertinent risk factor of upper respiratory tract infections (URTIs) and pneumonia is the exposure to major ambient air pollutants, with short term exposures to different air pollutants being shown to exacerbate several respiratory conditions.
Methods
Here, using disease surveillance data comprising of reported disease case counts at the province level, high frequency ambient air pollutant and climate data in Thailand, we delineated the association between ambient air pollution and URTI/Pneumonia burden in Thailand from 2000 – 2022. We developed mixed-data sampling methods and estimation strategies to account for the high frequency nature of ambient air pollutant concentration data. This was used to evaluate the effects past concentrations of fine particulate matter (PM
2.5
), sulphur dioxide (SO
2
), and carbon monoxide (CO) and the number of disease case count, after controlling for the confounding meteorological and disease factors.
Results
Across provinces, we found that past increases in CO, SO
2,
and PM
2.5
concentration were associated to changes in URTI and pneumonia case counts, but the direction of their association mixed. The contributive burden of past ambient air pollutants on contemporaneous disease burden was also found to be larger than meteorological factors, and comparable to that of disease related factors.
Conclusions
By developing a novel statistical methodology, we prevented subjective variable selection and discretization bias to detect associations, and provided a robust estimate on the effect of ambient air pollutants on URTI and pneumonia burden over a large spatial scale.
Journal Article
Forecasting carbon dioxide emissions based on a hybrid of mixed data sampling regression model and back propagation neural network in the USA
by
Calin, Adrian Cantemir
,
Han, Meng
,
Zhao, Xin
in
Algorithms
,
Aquatic Pollution
,
Atmospheric Protection/Air Quality Control/Air Pollution
2018
The accurate forecast of carbon dioxide emissions is critical for policy makers to take proper measures to establish a low carbon society. This paper discusses a hybrid of the mixed data sampling (MIDAS) regression model and BP (back propagation) neural network (MIDAS-BP model) to forecast carbon dioxide emissions. Such analysis uses mixed frequency data to study the effects of quarterly economic growth on annual carbon dioxide emissions. The forecasting ability of MIDAS-BP is remarkably better than MIDAS, ordinary least square (OLS), polynomial distributed lags (PDL), autoregressive distributed lags (ADL), and auto-regressive moving average (ARMA) models. The MIDAS-BP model is suitable for forecasting carbon dioxide emissions for both the short and longer term. This research is expected to influence the methodology for forecasting carbon dioxide emissions by improving the forecast accuracy. Empirical results show that economic growth has both negative and positive effects on carbon dioxide emissions that last 15 quarters. Carbon dioxide emissions are also affected by their own change within 3 years. Therefore, there is a need for policy makers to explore an alternative way to develop the economy, especially applying new energy policies to establish a low carbon society.
Journal Article
Impact of Data Corruption and Operating Temperature on Performance of Model-Based SoC Estimation
2024
Electric vehicles (EVs) are becoming popular around the world. Making a lithium battery (LIB) pack with a robust battery management system (BMS) for an EV to operate under different complex environments is both a challenge and a requirement for engineers. A BMS can intelligently manage LIB systems by estimating the battery state of charge (SoC). Due to the nonlinear characteristics of LIB, influenced by factors such as the harsh environment and data corruption caused by electromagnetic interference (EMI) inside electric vehicles, SoC estimation should consider available capacity, model parameters, operating temperature and reductions in data sampling time. The widely used model-based algorithms, such as the extended Kalman filter (EKF) have limitations. Therefore, a detailed review of the balance between temperature, data sampling time, and different model-based algorithms is necessary. Firstly, a state of charge—open-circuit voltage (SoC-OCV) curve of LIB is obtained by the polynomial curve fitting (PCF) method. Secondly, a first-order RC (1-RC) equivalent circuit model (ECM) is applied to identify the battery parameters using a forgetting factor-based recursive least squares algorithm (FF-RLS), ensuring accurate internal battery parameters for the next step of SoC estimation. Thirdly, different model-based algorithms are utilized to estimate the SoC of LIB under various operating temperatures and data sampling times. Finally, the experimental data by dynamic stress test (DST) is collected at temperatures of 10 °C, 25 °C, and 40 °C, respectively, to verify and analyze the impact of operating temperature and data sampling time to provide a practical reference for the SoC estimation.
Journal Article
SORAG: Synthetic Data Over-Sampling Strategy on Multi-Label Graphs
2022
In many real-world networks of interest in the field of remote sensing (e.g., public transport networks), nodes are associated with multiple labels, and node classes are imbalanced; that is, some classes have significantly fewer samples than others. However, the research problem of imbalanced multi-label graph node classification remains unexplored. This non-trivial task challenges the existing graph neural networks (GNNs) because the majority class can dominate the loss functions of GNNs and result in the overfitting of the majority class features and label correlations. On non-graph data, minority over-sampling methods (such as the synthetic minority over-sampling technique and its variants) have been demonstrated to be effective for the imbalanced data classification problem. This study proposes and validates a new hypothesis with unlabeled data over-sampling, which is meaningless for imbalanced non-graph data; however, feature propagation and topological interplay mechanisms between graph nodes can facilitate the representation learning of imbalanced graphs. Furthermore, we determine empirically that ensemble data synthesis through the creation of virtual minority samples in the central region of a minority and generation of virtual unlabeled samples in the boundary region between a minority and majority is the best practice for the imbalanced multi-label graph node classification task. Our proposed novel data over-sampling framework is evaluated using multiple real-world network datasets, and it outperforms diverse, strong benchmark models by a large margin.
Journal Article