Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Reading Level
      Reading Level
      Clear All
      Reading Level
  • Content Type
      Content Type
      Clear All
      Content Type
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Item Type
    • Is Full-Text Available
    • Subject
    • Publisher
    • Source
    • Donor
    • Language
    • Place of Publication
    • Contributors
    • Location
41,760 result(s) for "Data sampling"
Sort by:
Hyperparameter Optimization and Combined Data Sampling Techniques in Machine Learning for Customer Churn Prediction: A Comparative Analysis
This paper explores the application of various machine learning techniques for predicting customer churn in the telecommunications sector. We utilized a publicly accessible dataset and implemented several models, including Artificial Neural Networks, Decision Trees, Support Vector Machines, Random Forests, Logistic Regression, and gradient boosting techniques (XGBoost, LightGBM, and CatBoost). To mitigate the challenges posed by imbalanced datasets, we adopted different data sampling strategies, namely SMOTE, SMOTE combined with Tomek Links, and SMOTE combined with Edited Nearest Neighbors. Moreover, hyperparameter tuning was employed to enhance model performance. Our evaluation employed standard metrics, such as Precision, Recall, F1-score, and the Receiver Operating Characteristic Area Under Curve (ROC AUC). In terms of the F1-score metric, CatBoost demonstrates superior performance compared to other machine learning models, achieving an outstanding 93% following the application of Optuna hyperparameter optimization. In the context of the ROC AUC metric, both XGBoost and CatBoost exhibit exceptional performance, recording remarkable scores of 91%. This achievement for XGBoost is attained after implementing a combination of SMOTE with Tomek Links, while CatBoost reaches this level of performance after the application of Optuna hyperparameter optimization.
Unrestricted mixed data sampling (MIDAS): MIDAS regressions with unrestricted lag polynomials
Mixed data sampling (MIDAS) regressions allow us to estimate dynamic equations that explain a low frequency variable by high frequency variables and their lags. When the difference in sampling frequencies between the regressand and the regressors is large, distributed lag functions are typically employed to model dynamics avoiding parameter proliferation. In macroeconomic applications, however, differences in sampling frequencies are often small. In such a case, it might not be necessary to employ distributed lag functions. We discuss the pros and cons of unrestricted lag polynomials in MIDAS regressions. We derive unrestricted-MIDAS (U-MIDAS) regressions from linear high frequency models, discuss identification issues and show that their parameters can be estimated by ordinary least squares. In Monte Carlo experiments, we compare U-MIDAS with MIDAS with functional distributed lags estimated by non-linear least squares. We show that U-MIDAS performs better than MIDAS for small differences in sampling frequencies. However, with large differing sampling frequencies, distributed lag functions outperform unrestricted polynomials. The good performance of U-MIDAS for small differences in frequency is confirmed in empirical applications on nowcasting and short-term forecasting euro area and US gross domestic product growth by using monthly indicators.
Class-Difficulty Based Methods for Long-Tailed Visual Recognition
Long-tailed datasets are very frequently encountered in real-world use cases where few classes or categories (known as majority or head classes) have higher number of data samples compared to the other classes (known as minority or tail classes). Training deep neural networks on such datasets gives results biased towards the head classes. So far, researchers have come up with multiple weighted loss and data re-sampling techniques in efforts to reduce the bias. However, most of such techniques assume that the tail classes are always the most difficult classes to learn and therefore need more weightage or attention. Here, we argue that the assumption might not always hold true. Therefore, we propose a novel approach to dynamically measure the instantaneous difficulty of each class during the training phase of the model. Further, we use the difficulty measures of each class to design a novel weighted loss technique called ‘class-wise difficulty based weighted (CDB-W) loss’ and a novel data sampling technique called ‘class-wise difficulty based sampling (CDB-S)’. To verify the wide-scale usability of our CDB methods, we conducted extensive experiments on multiple tasks such as image classification, object detection, instance segmentation and video-action classification. Results verified that CDB-W loss and CDB-S could achieve state-of-the-art results on many class-imbalanced datasets such as ImageNet-LT, LVIS and EGTEA, that resemble real-world use cases.
An Efficient and Accurate Convolution-Based Similarity Measure for Uncertain Trajectories
With the rapid development of localization techniques and the prevalence of mobile devices, massive amounts of trajectory data have been generated, playing essential roles in areas of user analytics, smart transportation, and public safety. Measuring trajectory similarity is one of the fundamental tasks in trajectory analytics. Although considerable research has been conducted on trajectory similarity, the majority of existing approaches measure the similarity between two trajectories by calculating the distance between aligned locations, leading to challenges related to uncertain trajectories (e.g., low and heterogeneous data sampling rates, as well as location noise). To address these challenges, we propose Contra, a convolution-based similarity measure designed specifically for uncertain trajectories. The main focus of Contra is to identify the similarity of trajectory shapes while disregarding the time/order relevance of each record within the trajectory. To this end, it leverages a series of convolution and pooling operations to extract high-level geo-information from trajectories, and subsequently compares their similarities based on these extracted features. Moreover, we introduce efficient trajectory index strategies to enhance the computational efficiency of our proposed measure. We conduct comprehensive experiments on two trajectory datasets to evaluate the performance of our proposed approach. The experiments on both datasets show the effectiveness and efficiency of our approach. Specifically, the mean rank of Contra is 3 times better than the state-of-the-art approaches, and the precision of Contra surpasses baseline approaches by 20–40%.
Association between ambient air pollutants and upper respiratory tract infection and pneumonia disease burden in Thailand from 2000 to 2022: a high frequency ecological analysis
Background A pertinent risk factor of upper respiratory tract infections (URTIs) and pneumonia is the exposure to major ambient air pollutants, with short term exposures to different air pollutants being shown to exacerbate several respiratory conditions. Methods Here, using disease surveillance data comprising of reported disease case counts at the province level, high frequency ambient air pollutant and climate data in Thailand, we delineated the association between ambient air pollution and URTI/Pneumonia burden in Thailand from 2000 – 2022. We developed mixed-data sampling methods and estimation strategies to account for the high frequency nature of ambient air pollutant concentration data. This was used to evaluate the effects past concentrations of fine particulate matter (PM 2.5 ), sulphur dioxide (SO 2 ), and carbon monoxide (CO) and the number of disease case count, after controlling for the confounding meteorological and disease factors. Results Across provinces, we found that past increases in CO, SO 2, and PM 2.5 concentration were associated to changes in URTI and pneumonia case counts, but the direction of their association mixed. The contributive burden of past ambient air pollutants on contemporaneous disease burden was also found to be larger than meteorological factors, and comparable to that of disease related factors. Conclusions By developing a novel statistical methodology, we prevented subjective variable selection and discretization bias to detect associations, and provided a robust estimate on the effect of ambient air pollutants on URTI and pneumonia burden over a large spatial scale.
Impact of Data Corruption and Operating Temperature on Performance of Model-Based SoC Estimation
Electric vehicles (EVs) are becoming popular around the world. Making a lithium battery (LIB) pack with a robust battery management system (BMS) for an EV to operate under different complex environments is both a challenge and a requirement for engineers. A BMS can intelligently manage LIB systems by estimating the battery state of charge (SoC). Due to the nonlinear characteristics of LIB, influenced by factors such as the harsh environment and data corruption caused by electromagnetic interference (EMI) inside electric vehicles, SoC estimation should consider available capacity, model parameters, operating temperature and reductions in data sampling time. The widely used model-based algorithms, such as the extended Kalman filter (EKF) have limitations. Therefore, a detailed review of the balance between temperature, data sampling time, and different model-based algorithms is necessary. Firstly, a state of charge—open-circuit voltage (SoC-OCV) curve of LIB is obtained by the polynomial curve fitting (PCF) method. Secondly, a first-order RC (1-RC) equivalent circuit model (ECM) is applied to identify the battery parameters using a forgetting factor-based recursive least squares algorithm (FF-RLS), ensuring accurate internal battery parameters for the next step of SoC estimation. Thirdly, different model-based algorithms are utilized to estimate the SoC of LIB under various operating temperatures and data sampling times. Finally, the experimental data by dynamic stress test (DST) is collected at temperatures of 10 °C, 25 °C, and 40 °C, respectively, to verify and analyze the impact of operating temperature and data sampling time to provide a practical reference for the SoC estimation.
Forecasting carbon dioxide emissions based on a hybrid of mixed data sampling regression model and back propagation neural network in the USA
The accurate forecast of carbon dioxide emissions is critical for policy makers to take proper measures to establish a low carbon society. This paper discusses a hybrid of the mixed data sampling (MIDAS) regression model and BP (back propagation) neural network (MIDAS-BP model) to forecast carbon dioxide emissions. Such analysis uses mixed frequency data to study the effects of quarterly economic growth on annual carbon dioxide emissions. The forecasting ability of MIDAS-BP is remarkably better than MIDAS, ordinary least square (OLS), polynomial distributed lags (PDL), autoregressive distributed lags (ADL), and auto-regressive moving average (ARMA) models. The MIDAS-BP model is suitable for forecasting carbon dioxide emissions for both the short and longer term. This research is expected to influence the methodology for forecasting carbon dioxide emissions by improving the forecast accuracy. Empirical results show that economic growth has both negative and positive effects on carbon dioxide emissions that last 15 quarters. Carbon dioxide emissions are also affected by their own change within 3 years. Therefore, there is a need for policy makers to explore an alternative way to develop the economy, especially applying new energy policies to establish a low carbon society.
Ensemble-Based Machine Learning Algorithms Combined with Near Miss Method for Software Bug Prediction
Software bug prediction (SBP) involves identifying or categorizing software modules likely to contain defects, utilizing underlying system properties such as software metrics. SBP plays a crucial role in enhancing software project quality and mitigating maintenance risks. Numerous machine learning (ML) algorithms have been developed to predict software bugs. Class imbalance poses a significant challenge for these algorithms, significantly impeding their effectiveness and resulting in imbalanced false-positive and false-negative outcomes. However, limited research has been conducted to specifically tackle the issue of class imbalance in the context of SBP. This study investigates the prediction performance of a homogeneous ensemble: Bagging, boosting, and voting classifiers (VC) methods combined with the under-sampling methods to address the class imbalance problem and improve the accuracy of SBP. Two ensembles are classified as bagging ensembles: decision tree (DT) and random forest (RF); two ensembles are classified as boosting ensembles: AdaBoost (AB) and gradient boosting (GB), while the DT, RF, K-Nearest Neighbours (K-NN), and support vector machine (SVM) are considered as VC. To establish the effectiveness of the proposed models, the experiments were conducted on the available benchmark datasets, which comprise five public datasets based on both class and file-level metrics. We compared and evaluated the performance of the proposed models according to several performance measures, namely accuracy, precision, recall, f-measure, Matthew’s correlation coefficient (MCC), and the area under the receiver operating characteristic curve (AUROC). The experimental findings demonstrated that the proposed models exhibit superior efficiency in predicting software bugs on balanced datasets compared to the original datasets, with an improvement of up to 11% accuracy for the class-level metrics and 10% for the file-level metrics. The results indicate that the use of data sampling techniques had a positive impact on the prediction accuracy of the presented models. We compared our proposed method with existing SBP methods based on several standard performance measures. The comparison outcomes revealed a significant superiority of our method over the prevailing state-of-the-art SBP methods across most datasets.
Nowcasting Vietnam's Export Growth with Mixed Frequency Data
Purpose: The primary objective of this study is to investigate and employ a practical and meaningful nowcasting model to predict Vietnam's export growth based on factors of export supply and demand alongside relevant financial indicators.   Theoretical Framework: This study employs the concepts and theories of nowcasting model with mixed frequency data to create the conceptual framework.   Methodology: This study employs four commonly-used models in nowcasting: the bridge equation model (BEQ), Bayesian VAR model (BVAR), mixed frequency vector autoregressive model (MFVAR), and mixed data sampling regression (MIDAS).   Findings: According to the experimental findings, the mixed frequency data models outperformed the models utilizing the same frequency data in nowcasting Vietnam's export growth. Additionally, this model demonstrated effectiveness in instantaneous and short-term forecasting. MIDAS emerged as the most suitable choice for nowcasting Vietnam's export growth among the models examined.   Implication of Research: using data with mixed frequency along with corrresponding methods is the good way for nowcasting.   Originality/Value: This study used macroeconomics factors to nowcast the export growth in Vietnam. It applied four different models including BEQ, BVAR, MFVAR, and MIDAS. The study reveals the roles of data and the potential capability in nowcasting of MIDAS model.