Explore chapters and articles related to this topic
Anomaly Detection Enables Cybersecurity with Machine Learning Techniques
Published in Kim Phuc Tran, Machine Learning and Probabilistic Graphical Models for Decision Support Systems, 2023
Truong Thu Huong, Nguyen Minh Dan, Le Anh Quang, Nguyen Xuan Hoang, Le Thanh Cong, Kieu-Ha Phung, Kim Phuc Tran
According to work66, missing data mechanisms are divided into three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR means the missingness of data is unrelated to any values. MAR indicates that the tendency of a value to be missing might depend on the observed data, but not the missing data. In contrast, MNAR denotes that there exists a relationship between the missingness and its value. Considering the missing values as shown in Figure 21 in65, the SCADA data set is likely to have the characteristic of MAR. As a result, Last Observation Carried Forward (LOCF)67, a popular method to handle MAR, is applied to process the missing values in the SCADA dataset. In LOCF, the immediately preceding value in the same feature is used to fill in the missing value. If the data set begins with missing values, the first observed value is employed to substitute them.
Out-of-Sequence Measurements
Published in Jitendra R. Raol, Ajith K. Gopal, Mobile Intelligent Autonomous Systems, 2016
Imputation is the substitution or replacement of some value of a missing data point or missing component of a data point [44–46]. MI is one of the most attractive methods for general-purpose handling of missing data in multivariate analysis as a three-step process. First, sets of M plausible values (M = 5 in Figure 11.1) for missing instances are created using an appropriate model that reflects the uncertainty due to the missing data. Each of these sets of plausible values is used to ‘fill-in’ the missing values and create M ‘complete’ datasets (imputation). Second, each of these M datasets can be analysed using complete data methods (analysis). Finally, the results from the M complete datasets are combined, which also allows the uncertainty regarding the imputation is taken into account (pooling or combining).
CHAID as a Method for Filling in Missing Values
Published in Bruce Ratner, Statistical and Machine-Learning Data Mining, 2017
The definition of an imputation method is any process that fills in missing data to produce a complete dataset. The simplest and most popular imputation method is mean value imputation. The mean of the nonmissing values of the variable of interest is used to fill in the missing data. Consider Individuals 2 and 8 in Table 22.1. The mean AGE of the file, namely, 40 years (rounded from 39.6) replaces the missing ages with Individuals 2 and 8. The advantage of this method is undoubtedly its ease of use. The calculation of means, as required, is performed within classes, predefined by other variables related to the study at hand.
ST-FVGAN: filling series traffic missing values with generative adversarial network
Published in Transportation Letters, 2022
Bing Yang, Yan Kang, Yaoyao Yuan, Hao Li, Fei Wang
Nowadays, although many kinds of advanced equipment are used for data collection, the data missing are still unavoidable and widespread (Qu et al. 2009) due to the complexity of the traffic flow data and the influence of uncertain factors. Dealing with the missing values for data mining is thus the primary issue to be considered. At present, the missing data handling methods include the imputation and the deletion. The deletion method is to directly delete the incomplete data, which is simple but may drop out of the valuable information and cause errors in the experiments (Kaiser 2014); So, the imputation type attracts more and more attention. The available imputation approaches can be roughly categorized by three types: statistical learning-based, machine learning-based, and generative adversarial networks approaches (Li et al. 2018).
Big Data Analytics in Cyber Security: Network Traffic and Attacks
Published in Journal of Computer Information Systems, 2021
Many statistical analysis methods require complete data (without missing values) to achieve a good analysis result. Supervised methods in machine learning generally need to use a dataset without missing values for training before a model is created.17 Many algorithms may spend significantly longer time in processing datasets with missing values while they converge very fast if complete data are provided.18 Inaccurate data analytics results will be obtained if missing data are not handled appropriately. Failure to appropriately handle missing data brings up problems such as low efficiency and erroneous results. Correctly handling missing data is necessary for system robustness and efficiency. Estimating missing data is an important step in the data cleansing process of big data.19,20 There are two kinds of methods for handling missing data: imputation-based methods and deletion methods. An imputation-based method interpolates missing values using possible relationships among the data in a dataset. A deletion method discards instances that contain missing data. The more the data lose, the more inaccurate the analytics is.21 A deletion method can be used without losing statistical strength only if instances with missing values are much fewer compared with the total instances of a dataset.22
Comparison of Performance of Data Imputation Methods for Numeric Dataset
Published in Applied Artificial Intelligence, 2019
Anil Jadhav, Dhanya Pramod, Krishnan Ramanathan
Imputation of missing values: Missing data imputation is a procedure that replaces missing value with some plausible values (Rubin 1976). The various imputation techniques aim to provide accurate estimation of population parameters so that power of data mining and data analysis techniques is not reduced. Optimal treatment to be given to the missing data depends on amount of missing data. Although there is no thumb rule on what percentage of missing data is bad, it is always better to do comparison of results before and after imputation if more than 25% data is missing.