Data preprocessing – Knowledge and References

Explore chapters and articles related to this topic

Data Mining

Published in Bogdan M. Wilamowski, J. David Irwin, Intelligent Systems, 2018

The phases of the typical knowledge discovery process (Figure 30.2) can be described by three main phases. These phases are data integration (DI), OLAP, and front end, knowledge presentation tools. The first phase, data integration, entails data preprocessing such as data cleaning, integration, selection, and transformation. While data cleaning pertains to removal of noisy, inconsistent data, data integration refers to disparate data sources merging (from Oracle to flat Ascii). Data selection relates to task-applicable data retrieval, and transformation involves data conversion into a format adequate for the next step, data mining. This phase is popularly known as ETL (extraction, transformation, loading), and results in the creation of data marts (DM) and data warehouses (DW). Data warehouses are large data repositories composed of data marts. The final phase is front-end tools often referred to as business intelligence (BI) portal type tools.

Data generation, collection, analysis, and preprocessing

View Chapter

Purchase Book

Published in Madhusree Kundu, Palash Kumar Kundu, Seshu Kumar Damarla, Chemometric Monitoring: Product Quality Assessment, Process Fault Detection, and Applications, 2017

Madhusree Kundu, Palash Kumar Kundu, Seshu Kumar Damarla

This chapter presents computer-based data acquisition, design of experiments for data generation with illustrations, and data preprocessing. Outlier detection, data reconciliation, and data transform are included as components of data preprocessing. Basic statistical measures and hypothesis testing and regression are also included in this chapter. This book deploys data-based techniques in designing miscellaneous applications. It is essential to determine the nature of the underlying process: whether it is stochastic and deterministic, and whether the collected data are stationary or nonstationary, so that the appropriate data preprocessing and application algorithms can be developed with utmost confidence. This chapter manifests and exploits the power of simple EXCEL and MATLAB® functions crafted through suitably designed problems. This chapter is a handy and composite guide on data. One can take this chapter in a sense of an elaborate taxonomy towards an easy transition to the core chapters of this book.

Analysis and Prediction of Crime Rate against Women Using Classification and Regression Trees

View Chapter

Purchase Book

Published in Pallavi Vijay Chavan, Parikshit N Mahalle, Ramchandra Mangrulkar, Idongesit Williams, Data Science, 2022

P. Tamilarasi, R. Uma Rani

Data preprocessing is one type of data analytics technique that is mainly used to convert the data from raw format to meaningful format or machine understanding format. Data cleaning is one of the main parts of data preprocessing. It is used to handle the situation of missing and noisy information in the dataset. Missing data is handled in many ways. This is to ignore the tuples while filling the missing values. The row is eliminated from the data when repeated values are present in the dataset and missing values are filled manually by using mean and other probable values. In this chapter, the missing values are filled by using mean values.

Food products pricing theory with application of machine learning and game theory approach

View Article

Journal Information

Published in International Journal of Production Research, 2022

Mobina Mousapour Mamoudan, Zahra Mohammadnazari, Ali Ostadi, Ali Esfahbodi

Data preprocessing is a significant step in achieving better performance and low forecast error in machine learning models and deep learning-based models. Data preprocessing is about dealing with inconsistent, missing, and noisy data. The database used in this article does not contain this type of data. In addition, data preprocessing included data cleansing, normalisation, and restructuring. Because machine learning models and CNN-LSTM-GA are sensitive to the scale of the inputs, the data is normalised using feature scaling in the range [0, 1]. The normalisation method is shown in Equation 13. In this Equation, and are the minimal and maximal value of each import data series. We consider the first 70% of the Dataset to train the model (training set) 10% of the Dataset to Hyper-parameter tuning and model optimisation (validation set), and the remaining 20% to test and evaluate the model (test set).

Low cost network traffic measurement and fast recovery via redundant row subspace-based matrix completion

View Article

Journal Information

Published in Connection Science, 2023

Kai Jin, Kun Xie, Jiazheng Tian, Wei Liang, Jigang Wen

Table 2 shows the data sets used in this study. Data normalization (Aksoy & Haralick, 2001) is often applied in data preprocessing to scale the features of data into the range [0,1]. We normalise the Abilene and GÈANT in the following manner: where the and are the minimum value and maximum value of the data set. This normalization process does not affect our recovery results. After we get the recovery results, we first perform the inverse of the normalization operation on the recovery results and then calculate the recovered error ratio.

Feature Selection and Instance Selection from Clinical Datasets Using Co-operative Co-evolution and Classification Using Random Forest

View Article

Journal Information

Published in IETE Journal of Research, 2022

V. R. Elgin Christo, H. Khanna Nehemiah, J. Brighty, Arputharaj Kannan

Clinical Decision Support Systems (CDSS) rely on knowledge extraction from clinical datasets for decision-making. Generally, physicians use CDSSs as a source of opinion for disease diagnosis [1]. Knowledge elicitation from clinical databases plays a major role in building up a CDSS. The major steps involved in knowledge discovery from databases include data selection, data cleaning, data integration, data reduction, data mining, pattern evaluation, and knowledge representation [2]. Data preprocessing involves data cleaning, data integration, data transformation, and data reduction. Data mining is a vital step in the knowledge discovery process and it helps in extracting patterns, models, and rules from the dataset [2].