Catalogue Search | MBRL

Review on data imputation methods in machine learning

by Xue, Jianing in data imputation methods , Machine learning , Missing data

2023

Data is an important element in the analysis of machine learning. It is usually measured based on observations and is also an indispensable element in training a model. Good preparation of data helps enhance the performance of analysis and is able to deliver reliable final results. However, lots of factors influence the dataset and some lead to the loss of some data. When some portion of the data is missing, it causes biases in the final prediction outcomes. In order to minimize the consequences of missing data, several data imputation methods are established to solve the problem. This paper will first talk about some basic concepts about missing data. In the following sections, the paper will present several popular data imputation methods, including complete case analysis, single imputation, and multiple imputations. Applications of some methods will be presented to see how they can be used in real analysis situations. Finally, the paper will talk about the limits of current data imputation methods.

Journal Article

Share this book

Add to My Shelf

Missing traffic data: comparison of imputation methods

by Li, Yuebiao , Li, Li , Li, Zhiheng in data imputation methods , Detectors , Failure

2014

Many traffic management and control applications require highly complete and accurate data of traffic flow. However, because of various reasons such as sensor failure or transmission error, it is common that some traffic flow data are lost. As a result, various methods were proposed by using a wide spectrum of techniques to estimate missing traffic data in the last two decades. Generally, these missing data imputation methods can be categorised into three kinds: prediction methods, interpolation methods and statistical learning methods. To assess their performance, these methods are compared from different aspects in this paper, including reconstruction errors, statistical behaviours and running speeds. Results show that statistical learning methods are more effective than the other two kinds of imputation methods when data of a single detector is utilised. Among various methods, the probabilistic principal component analysis (PPCA) yields best performance in all aspects. Numerical tests demonstrate that PPCA can be used to impute data online before making further analysis (e.g. make traffic prediction) and is robust to weather changes.

Journal Article

Share this book

Add to My Shelf

A case study: Exploratory analysis and proposal for the problem of quality in the meteorological observations of the north of Chile

by Contreras Aguilar, David , Inostrosa-Psijas, Alonso , Cerda Lozano, Sergio in Case studies , Climate change , Error correction

2024

Data quality problems in meteorological variables are a situation that the scientific community constantly faces, mainly because these quality problems materialize as missing data within the time series, which prevents compliance with the established requirements to analyze climate change in a given geographical area. Based on this problem, this article presents an exploratory analysis of the main meteorological variables (Temperature and Precipitation) observed by the meteorological stations distributed in Northern Chile to assess the data quality they present. Data imputation methods are also proposed to address this problem by completing the missing data. In particular, the experiments developed based on the phases of the CRISP-DM methodology are presented in an adapted way considering five different imputation methods of which the residual error closest to zero and the highest positive correction is sought. In the results, CLP, IDC, and RN stand out as the best techniques, which allows us to conclude that these methods can be recommended and proposed as an alternative solution according to the meteorological variable.

Journal Article

Share this book

Add to My Shelf

Imputing Missing Data: A Comparison of Methods for Social Work Researchers

by Doré, Peter , Spitznagel, Edward , Saunders, Jeanne A. in Comparative Analysis , Data , Data Analysis

2006

Choosing the most appropriate method to handle missing data during analyses is one of the most challenging decisions confronting researchers. Often, missing values are just ignored rather than replaced with a reliable imputation method. Six methods of data imputation were used to replace missing data from two data sets of varying sizes; this article examines the results. Each imputation method is defined, and the pros and cons of its use in social science research are identified. The authors discuss comparisons of descriptive measures and multivariate analyses with the imputed variables and the results of a timed study to determine how long it took to use each imputation method on first and subsequent use. Implications for social work research are suggested.

Journal Article

Share this book

Add to My Shelf

A comparison of multiple imputation methods for missing data in longitudinal studies

by Carlin, John B. , Lee, Katherine J. , Simpson, Julie A. in Adolescent , Algorithms , Analysis

2018

Background Multiple imputation (MI) is now widely used to handle missing data in longitudinal studies. Several MI techniques have been proposed to impute incomplete longitudinal covariates, including standard fully conditional specification (FCS-Standard) and joint multivariate normal imputation (JM-MVN), which treat repeated measurements as distinct variables, and various extensions based on generalized linear mixed models. Although these MI approaches have been implemented in various software packages, there has not been a comprehensive evaluation of the relative performance of these methods in the context of longitudinal data. Method Using both empirical data and a simulation study based on data from the six waves of the Longitudinal Study of Australian Children ( N = 4661), we investigated the performance of a wide range of MI methods available in standard software packages for investigating the association between child body mass index (BMI) and quality of life using both a linear regression and a linear mixed-effects model. Results In this paper, we have identified and compared 12 different MI methods for imputing missing data in longitudinal studies. Analysis of simulated data under missing at random (MAR) mechanisms showed that the generally available MI methods provided less biased estimates with better coverage for the linear regression model and around half of these methods performed well for the estimation of regression parameters for a linear mixed model with random intercept. With the observed data, we observed an inverse association between child BMI and quality of life, with available data as well as multiple imputation. Conclusion Both FCS-Standard and JM-MVN performed well for the estimation of regression parameters in both analysis models. More complex methods that explicitly reflect the longitudinal structure for these analysis models may only be needed in specific circumstances such as irregularly spaced data.

Journal Article

Share this book

Add to My Shelf

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

by Lynn, Henry S. , Hong, Shangzhi in Accuracy , Algorithms , Amputation

2020

Background Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. Methods To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM). Results Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction. Conclusions RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

Journal Article

Share this book

Add to My Shelf

SICE: an improved missing data imputation technique

by Khan, Shahidul Islam , Hoque, Abu Sayed Md Latiful in Algorithms , Animals , Big Data

2020

In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.

Journal Article

Share this book

Add to My Shelf

The ability of different imputation methods for missing values in mental measurement questionnaires

by Wu, Shaoning , Zhang, Qimeng , Xu, Xueying in Activities of Daily Living , Clinical trials , Computer Simulation

2020

Background Incomplete data are of particular important influence in mental measurement questionnaires. Most experts, however, mostly focus on clinical trials and cohort studies and generally pay less attention to this deficiency. We aim is to compare the accuracy of four common methods for handling items missing from different psychology questionnaires according to the items non-response rates. Method All data were drawn from the previous studies including the self-acceptance scale (SAQ), the activities of daily living scale (ADL) and self-esteem scale (RSES). SAQ and ADL dataset, simulation group, were used to compare and assess the ability of four imputation methods which are direct deletion, mode imputation, Hot-deck (HD) imputation and multiple imputation (MI) by absolute deviation, the root mean square error and average relative error in missing proportions of 5, 10, 15 and 20%. RSES dataset, validation group, was used to test the application of imputation methods. All analyses were finished by SAS 9.4. Results The biases obtained by MI are the smallest under various missing proportions. HD imputation approach performed the lowest absolute deviation of standard deviation values. But they got the similar results and the performances of them are obviously better than direct deletion and mode imputation. In a real world situation, the respondents’ average score in complete data set was 28.22 ± 4.63, which are not much different from imputed datasets. The direction of the influence of the five factors on self-esteem was consistent, although there were some differences in the size and range of OR values in logistic regression model. Conclusion MI shows the best performance while it demands slightly more data analytic capacity and skills of programming. And HD could be considered to impute missing values in psychological investigation when MI cannot be performed due to limited circumstances.

Journal Article

Share this book

Add to My Shelf

HIOC: a hybrid imputation method to predict missing values in medical datasets

by Rani, Pooja , Jain, Anurag , Kumar, Rajneesh in Breast cancer , Cardiovascular disease , Classifiers

2021

PurposeDecision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases. However, the performance of these systems is adversely affected by the missing values in medical datasets. Imputation methods are used to predict these missing values. In this paper, a new imputation method called hybrid imputation optimized by the classifier (HIOC) is proposed to predict missing values efficiently.Design/methodology/approachThe proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations (MICE), K nearest neighbor (KNN), mean and mode imputation methods in an optimum way. Performance of HIOC has been compared to MICE, KNN, and mean and mode methods. Four classifiers support vector machine (SVM), naive Bayes (NB), random forest (RF) and decision tree (DT) have been used to evaluate the performance of imputation methods.FindingsThe results show that HIOC performed efficiently even with a high rate of missing values. It had reduced root mean square error (RMSE) up to 17.32% in the heart disease dataset and 34.73% in the breast cancer dataset. Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases. It increased classification accuracy up to 18.61% in the heart disease dataset and 6.20% in the breast cancer dataset.Originality/valueThe proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.

Journal Article

Share this book

Add to My Shelf

Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls

by Carlin, John B , Spratt, Michael , Kenward, Michael G in Bias , Biomedical Research - standards , Biomedical Research - statistics & numerical data

2009

Most studies have some missing data. Jonathan Sterne and colleagues describe the appropriate use and reporting of the multiple imputation approach to dealing with them

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter