Catalogue Search | MBRL

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study

by Deforth, Manja , Heinze, Georg , Held, Ulrike in Accuracy , Algorithms , Amputation

2024

The development of clinical prediction models is often impeded by the occurrence of missing values in the predictors. Various methods for imputing missing values before modeling have been proposed. Some of them are based on variants of multiple imputations by chained equations, while others are based on single imputation. These methods may include elements of flexible modeling or machine learning algorithms, and for some of them user-friendly software packages are available. The aim of this study was to investigate by simulation if some of these methods consistently outperform others in performance measures of clinical prediction models. We simulated development and validation cohorts by mimicking observed distributions of predictors and outcome variable of a real data set. In the development cohorts, missing predictor values were created in 36 scenarios defined by the missingness mechanism and proportion of noncomplete cases. We applied three imputation algorithms that were available in R software (R Foundation for Statistical Computing, Vienna, Austria): mice, aregImpute, and missForest. These algorithms differed in their use of linear or flexible models, or random forests, the way of sampling from the predictive posterior distribution, and the generation of a single or multiple imputed data set. For multiple imputation, we also investigated the impact of the number of imputations. Logistic regression models were fitted with the simulated development cohorts before (full data analysis) and after missing value generation (complete case analysis), and with the imputed data. Prognostic model performance was measured by the scaled Brier score, c-statistic, calibration intercept and slope, and by the mean absolute prediction error evaluated in validation cohorts without missing values. Performance of full data analysis was considered as ideal. None of the imputation methods achieved the model's predictive accuracy that would be obtained in case of no missingness. In general, complete case analysis yielded the worst performance, and deviation from ideal performance increased with increasing percentage of missingness and decreasing sample size. Across all scenarios and performance measures, aregImpute and mice, both with 100 imputations, resulted in highest predictive accuracy. Surprisingly, aregImpute outperformed full data analysis in achieving calibration slopes very close to one across all scenarios and outcome models. The increase of mice's performance with 100 compared to five imputations was only marginal. The differences between the imputation methods decreased with increasing sample sizes and decreasing proportion of noncomplete cases. In our simulation study, model calibration was more affected by the choice of the imputation method than model discrimination. While differences in model performance after using imputation methods were generally small, multiple imputation methods as mice and aregImpute that can handle linear or nonlinear associations between predictors and outcome are an attractive and reliable choice in most situations. [Display omitted]

Journal Article

Share this book

Add to My Shelf

Imputing Missing Data in Hourly Traffic Counts

by Shafique, Muhammad Awais in AADT , Accuracy , Australia

2022

Hourly traffic volumes, collected by automatic traffic recorders (ATRs), are of paramount importance since they are used to calculate average annual daily traffic (AADT) and design hourly volume (DHV). Hence, it is necessary to ensure the quality of the collected data. Unfortunately, ATRs malfunction occasionally, resulting in missing data, as well as unreliable counts. This naturally has an impact on the accuracy of the key parameters derived from the hourly counts. This study aims to solve this problem. ATR data from New South Wales, Australia was screened for irregularities and invalid entries. A total of 25% of the reliable data was randomly selected to test thirteen different imputation methods. Two scenarios for data omission, i.e., 25% and 100%, were analyzed. Results indicated that missForest outperformed other imputation methods; hence, it was used to impute the actual missing data to complete the dataset. AADT values were calculated from both original counts before imputation and completed counts after imputation. AADT values from imputed data were slightly higher. The average daily volumes when plotted validated the quality of imputed data, as the annual trends demonstrated a relatively better fit.

Journal Article

Share this book

Add to My Shelf

Research on Oil Well Production Prediction Based on GRU-KAN Model Optimized by PSO

by Qiu, Bo , Yang, Yun , Zhou, Zhongyi in Accuracy , Algorithms , Artificial intelligence

2024

Accurately predicting oil well production volume is of great significance in oilfield production. To overcome the shortcomings in the current study of oil well production prediction, we propose a hybrid model (GRU-KAN) with the gated recurrent unit (GRU) and Kolmogorov–Arnold network (KAN). The GRU-KAN model utilizes GRU to extract temporal features and KAN to capture complex nonlinear relationships. First, the MissForest algorithm is employed to handle anomalous data, improving data quality. The Pearson correlation coefficient is used to select the most significant features. These selected features are used as input to the GRU-KAN model to establish the oil well production prediction model. Then, the Particle Swarm Optimization (PSO) algorithm is used to enhance the predictive performance. Finally, the model is evaluated on the test set. The validity of the model was verified on two oil wells and the results on well F14 show that the proposed GRU-KAN model achieves a Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Coefficient of Determination (R2) values of 11.90, 9.18, 6.0% and 0.95, respectively. Compared to popular single and hybrid models, the GRU-KAN model achieves higher production-prediction accuracy and higher computational efficiency. The model can be applied to the formulation of oilfield-development plans, which is of great theoretical and practical significance to the advancement of oilfield technology levels.

Journal Article

Share this book

Add to My Shelf

Imputation of GPS Coordinate Time Series Using missForest

by Zhang, Shengkai , Lei, Jintao , Zeng, Qi in Algorithms , Geodesy , Geophysical studies

2021

The global positioning system (GPS) can provide the daily coordinate time series to help geodesy and geophysical studies. However, due to logistics and malfunctioning, missing values are often “seen” in GPS time series, especially in polar regions. Acquiring a consistent and complete time series is the prerequisite for accurate and reliable statical analysis. Previous imputation studies focused on the temporal relationship of time series, and only a few studies used spatial relationships and/or were based on machine learning methods. In this study, we impute 20 Greenland GPS time series using missForest, which is a new machine learning method for data imputation. The imputation performance of missForest and that of four traditional methods are assessed, and the methods’ impacts on principal component analysis (PCA) are investigated. Results show that missForest can impute more than a 30-day gap, and its imputed time series has the least influence on PCA. When the gap size is 30 days, the mean absolute value of the imputed and true values for missForest is 2.71 mm. The normalized root mean squared error is 0.065, and the distance of the first principal component is 0.013. missForest outperforms the other compared methods. missForest can effectively restore the information of GPS time series and improve the results of related statistical processes, such as PCA analysis.

Journal Article

Share this book

Add to My Shelf

A novel MissForest-based missing values imputation approach with recursive feature elimination in medical applications

by Wu, Ruei-Yan , Lin, Yen-Cheng , Lin, Ting-Yin in Accuracy , Algorithms , Analysis

2024

Background Missing values in datasets present significant challenges for data analysis, particularly in the medical field where data accuracy is crucial for patient diagnosis and treatment. Although MissForest (MF) has demonstrated efficacy in imputation research and recursive feature elimination (RFE) has proven effective in feature selection, the potential for enhancing MF through RFE integration remains unexplored. Methods This study introduces a novel imputation method, “recursive feature elimination-MissForest” (RFE-MF), designed to enhance imputation quality by reducing the impact of irrelevant features. A comparative analysis is conducted between RFE-MF and four classical imputation methods: mean/mode, k-nearest neighbors (kNN), multiple imputation by chained equations (MICE), and MF. The comparison is carried out across ten medical datasets containing both numerical and mixed data types. Different missing data rates, ranging from 10 to 50%, are evaluated under the missing completely at random (MCAR) mechanism. The performance of each method is assessed using two evaluation metrics: normalized root mean squared error (NRMSE) and predictive fidelity criterion (PFC). Additionally, paired samples t -tests are employed to analyze the statistical significance of differences among the outcomes. Results The findings indicate that RFE-MF demonstrates superior performance across the majority of datasets when compared to four classical imputation methods (mean/mode, kNN, MICE, and MF). Notably, RFE-MF consistently outperforms the original MF, irrespective of variable type (numerical or categorical). Mean/mode imputation exhibits consistent performance across various scenarios. Conversely, the efficacy of kNN imputation fluctuates in relation to varying missing data rates. Conclusion This study demonstrates that RFE-MF holds promise as an effective imputation method for medical datasets, providing a novel approach to addressing missing data challenges in medical applications.

Journal Article

Share this book

Add to My Shelf

An Improved Air Quality Index Machine Learning-Based Forecasting with Multivariate Data Imputation Approach

by Zhu, Qinqin , Alkabbani, Hanin , Elkamel, Ali in Air pollution , Air pollution control , Air pollution measurements

2022

Accurate, timely air quality index (AQI) forecasting helps industries in selecting the most suitable air pollution control measures and the public in reducing harmful exposure to pollution. This article proposes a comprehensive method to forecast AQIs. Initially, the work focused on predicting hourly ambient concentrations of PM2.5 and PM10 using artificial neural networks. Once the method was developed, the work was extended to the prediction of other criteria pollutants, i.e., O3, SO2, NO2, and CO, which fed into the process of estimating AQI. The prediction of the AQI not only requires the selection of a robust forecasting model, it also heavily relies on a sequence of pre-processing steps to select predictors and handle different issues in data, including gaps. The presented method dealt with this by imputing missing entries using missForest, a machine learning-based imputation technique which employed the random forest (RF) algorithm. Unlike the usual practice of using RF at the final forecasting stage, we utilized RF at the data pre-processing stage, i.e., missing data imputation and feature selection, and we obtained promising results. The effectiveness of this imputation method was examined against a linear imputation method for the six criteria pollutants and the AQI. The proposed approach was validated against ambient air quality observations for Al-Jahra, a major city in Kuwait. Results obtained showed that models trained using missForest-imputed data could generalize AQI forecasting and with a prediction accuracy of 92.41% when tested on new unseen data, which is better than earlier findings.

Journal Article

Share this book

Add to My Shelf

Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms

by Pauly, Markus , Buczak, Philip , Chen, Jian-Jia in Algorithms , Analysis , Classification

2023

Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subsequent analysis. Focusing on binary classification problems, we analyzed how missing value imputation under MCAR as well as MAR missingness with different missing patterns affects the predictive performance of subsequent classification. To this end, we compared imputation methods such as several MICE variants, missForest, Hot Deck as well as mean imputation with regard to the classification performance achieved with commonly used classifiers such as Random Forest, Extreme Gradient Boosting, Support Vector Machine and regularized logistic regression. Our simulation results showed that Random Forest based imputation (i.e., MICE Random Forest and missForest) performed particularly well in most scenarios studied. In addition to these two methods, simple mean imputation also proved to be useful, especially when many features (covariates) contained missing values.

Journal Article

Share this book

Add to My Shelf

Development of a novel imputation framework for PM2.5 particle data in Pakistani cities using machine learning and statistical techniques

by Gray, Alison , Alshatti, Amani , Pan, Jiazhu in air quality monitoring , machine learning , missForest

2026

IntroductionMissing PM2.5 observations in environmental monitoring systems, caused by sensor malfunctions, communication failures, maintenance issues, and coverage gaps, compromise public health assessments and evidence-based air quality policymaking. Reliable imputation strategies are therefore essential to preserve data integrity and analytical validity.MethodsThis study evaluated five imputation techniques: Bayesian Regression (BR), K-Nearest Neighbors (KNN), missForest, Predictive Mean Matching (PMM), and Random Forest (RF), using daily PM2.5 measurements collected between May 2019 and December 2024 from monitoring stations in Islamabad, Karachi, Lahore, and Peshawar, Pakistan. Three missing data mechanisms, MCAR, MAR, and MNAR, were simulated at missing rates ranging from 5% to 25%. Model performance was assessed using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).ResultsImputation under the MAR mechanism consistently yielded lower error values as missingness increased. Across all mechanisms and missing rates, missForest and KNN demonstrated superior performance. Notably, missForest achieved the lowest RMSE and MAE values overall and effectively preserved the temporal structure, range, and variability of the PM2.5 series.DiscussionThe findings suggest that machine-learning-based approaches, particularly missForest, provide robust and reliable imputation for PM2.5 datasets with varying missingness patterns. These results support the use of missForest as a preferred method for handling incomplete air quality data in similar monitoring contexts, thereby strengthening the reliability of environmental health analyses and air quality policy development.

Journal Article

Share this book

Add to My Shelf

Ensemble Methods for Jump-Diffusion Models of Power Prices

by Baldassari, Cristiano , Mari, Carlo in jump-diffusion dynamics , machine learning , mean-reversion

2021

We propose a machine learning-based methodology which makes use of ensemble methods with the aims (i) of treating missing data in time series with irregular observation times and detecting anomalies in the observed time behavior; (ii) of defining suitable models of the system dynamics. We applied this methodology to US wholesale electricity price time series that are characterized by missing data, high and stochastic volatility, jumps and pronounced spikes. For missing data, we provide a repair approach based on the missForest algorithm, an imputation algorithm which is completely agnostic about the data distribution. To identify anomalies, i.e., turbulent movements of power prices in which jumps and spikes are observed, we took into account the no-gap reconstructed electricity price time series, and then we detected anomalous regions using the isolation forest algorithm, an anomaly detection method that isolates anomalies instead of profiling normal data points as in the most common techniques. After removing anomalies, the additional gaps will be newly filled by the missForest imputation algorithm. In this way, a complete and clean time series describing the stable dynamics of power prices can be obtained. The decoupling between the stable motion and the turbulent motion allows us to define suitable jump-diffusion models of power prices and to provide an estimation procedure that uses the full information contained in both the stable and the turbulent dynamics.

Journal Article

Share this book

Add to My Shelf

Deep learning models for the analysis of high-dimensional survival data with time-varying covariates while handling missing data

by Mwambi, Henry , Mohammed, Mohanad , Ogutu, Sarah in Artificial Intelligence , Computer Science , Cytokine profiles

2025

Recent advances in deep learning have expanded the potential for predictive modeling in survival analysis, particularly in high-dimensional datasets with time-varying covariates. This paper applies deep learning approaches, DeepSurv, DeepHit, and Dynamic DeepHit, to model HIV incidence (time-to-event outcome) using high-dimensional longitudinal data, incorporating time-varying cytokine profiles alongside baseline covariates. We employ the time-dependent concordance index (C-index) and Brier scores to assess the models’ predictive accuracy. We also address missing data using missForest, evaluating model performance on imputed and complete-case datasets. Different strategies for integrating cytokine profiles were explored: DeepSurv and DeepHit utilized derived variables, mean, and difference between the first and last measurements, while Dynamic DeepHit preserved the original time-varying nature of the cytokine data. Our findings demonstrate that retaining the dynamic nature of cytokine covariates, rather than relying on derived summary measures, underscores the robustness and suitability of Dynamic DeepHit as a clinical prediction model, particularly in scenarios where key variables evolve over time.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter