Explore chapters and articles related to this topic
Big Data Analytics and Machine Learning
Published in Debabrata Samanta, SK Hafizul Islam, Naveen Chilamkurti, Mohammad Hammoudeh, Data Analytics, Computational Statistics, and Operations Research for Engineers, 2022
Francis Alex Kuzhippallil, Adith Kumar Menon, S Ramani, Marimuthu Karuppiah
The presence of outliers and missing values disrupt the extraction of useful information from the data. Furthermore, the lack of data quality is being indicated by the presence of outliers and missing values. Outliers denote extreme data points from the normally distributed data. Detection and removal of outliers followed by an elimination or imputation of missing values is a pivotal task. The most prominent outlier detection techniques are DBSCAN, isolation forest, Z-score, local outlier factor, and numeric outlier detection. Whereas for imputation of missing values, the following techniques are useful, namely, mean imputation, hot-deck imputation, cold-deck imputation, regression imputation, stochastic regression imputation, interpolation, and extrapolation. By doing so, cleansing of data is achieved (Chong et al. 2015).
Missing Data
Published in Julian J. Faraway, Linear Models with Python, 2021
The single imputation methods described above cause bias, while deletion causes a loss of information from the data. Multiple imputation is a way to reduce the bias caused by single imputation. The problem with single imputation is that the imputed value, be it a mean or a regression-predicted value, tends to be less variable than the value we would have seen because the imputed value does not include the variation that would normally be seen in observed data. The idea behind multiple imputation is to reinclude that variation — we add back on a perturbation to the imputed value. Hot deck imputation is an old idea where a randomly sampled value from the complete values for that predictor is used as the imputation. With more computing power, we can do better than this.
Data Analysis
Published in Shyama Prasad Mukherjee, A Guide to Research Methodology, 2019
Three kinds of imputation can be distinguished: deductive imputations, deterministic imputations and stochastic imputations. Deductive imputations are imputations where the imputed value is deduced from known information. For example, the age of a person can be deduced from the given date of birth, and the total income can be found by adding the different income components. In a panel survey the year of birth of a person is constant, so if this variable is missing in a wave, the score on this variable from another wave could be copied to the missing value. Though it is a correct and often applied imputation technique, from a methodological point of view deductive imputation is less interesting than the other two kinds of imputation. Sometimes deductive imputation is considered part of the editing process.
Socioeconomic factors and bacillary dysentery risk in Jiangsu Province, China: a spatial investigation using Bayesian hierarchical models
Published in International Journal of Environmental Health Research, 2022
Sabrina Li, Alexandra M. Schmidt, Susan J. Elliott
Missing values were processed using the multiple data imputation strategy. An advantage of the multiple imputation method over single imputation methods is that it retains a level of uncertainty, which helps to preserve the integrity and accuracy of the standard errors and model fit coefficient estimates. We applied this method to all variables with missing values. Given our data set with missing cases, five random draws were taken from the group of valid cases in the data set. This was used to create a data set of five random values. From this data set, an average was taken and adopted as the value for the missing observation. The number of counties with missing BD data between 2011–2014 was three, but one county had missing data for both 2012 and 2014. Thus, in total, this accounted for 1.8% (4/228) of all BD data reported at the county-level during 2011–2014. For socioeconomic covariates, there were only two counties with missing data on rural income in 2013, which accounted for 3.5% (2/57) of all data on county-level rural income reported in 2013. Maps illustrating counties with missing data are presented in Figure S1 in the Supplementary Information.
Comparison of Performance of Data Imputation Methods for Numeric Dataset
Published in Applied Artificial Intelligence, 2019
Anil Jadhav, Dhanya Pramod, Krishnan Ramanathan
Quality of the data is main concern of the data scientists. Although quality of data depends on several factors, one of the main factors is data incompleteness. Therefore, issues concerning missing data must be dealt with rigor by data scientists before analyzing data and viable decisions are taken by end users of the data mining projects. Data imputation is one of the techniques of handling missing values to make data complete and ready for analysis by replacing missing values with most plausible values. In this paper, we have discussed the concept of data imputation, data missingness mechanisms, handling missing values, Single and Multiple Imputation Methods, and then analysis of performance of different imputation methods namely mean imputation, median imputation, kNN imputation, predictive mean matching (PMM), Bayesian Linear Regression (norm), Linear Regression, non-Bayesian (norm.nob), and Random Sample methods.
Missing data imputation for traffic flow based on combination of fuzzy neural network and rough set theory
Published in Journal of Intelligent Transportation Systems, 2021
Jinjun Tang, Xinshao Zhang, Weiqi Yin, Yajie Zou, Yinhai Wang
However, based on the above research in the field of missing traffic flow data imputation, there are still several limitations or challenges in the current studies: 1) most imputation methods are used for modeling and data recovery under the assumption of complete data; 2) although traditional methods such as historical average and ARIMA have simple calculation structure and high calculation speed, the imputation performance is not good when the traffic flow data express irregular variation and changes; 3) each individual model has its own advantages, and has its specific scope of application. Combining several models with different characteristics could improve imputation performance.