Explore chapters and articles related to this topic
Exploratory Data Analysis and Data Visualization
Published in Chong Ho Alex Yu, Data Mining and Exploration, 2022
Exploratory data analysis (EDA) is a strategy of data analysis that emphasizes exploring the data and maintaining an open mind to alternative possibilities. EDA is an attitude or philosophy about how data analysis should be carried out, rather than a fixed set of techniques. This research tradition was founded by John Tukey, who often related EDA to detective work. In EDA, the role of the researcher is to explore the data in as many ways as possible until a plausible “story” of the data emerges. A detective does not collect just any information, but prioritizes clues related to the central question of the case. By the same token, EDA is not about “fishing” or “torturing” the data set until it confesses. Rather, it is a systematic way to investigate relevant information from multiple perspectives (Behrens 1996; 2000; Behrens et al. 2013; De Mast and Trip 2007; De Mast and Kemper 2009; Jebb et al. 2017; Tukey 1977; Yu 2017).
*
Published in Jacques Buffle, Herman P. van Leeuwen, Environmental Particles, 2019
Besides its direct role in exploratory data analysis and multivariate data display, PCA is the first step in a collection of qualitative and quantitative data analytical techniques collectively labeled Factor Analysis (FA). As with cluster analysis, there is a large literature on the topic. An excellent introduction is given in Massart et al.;93 the primary text for chemical applications is that by Malinowski;103 and an introduction to aerosol apportionment applications is found in Hopke.26
Exploratory Analysis of Run-Off-Road Crash Patterns
Published in Amir H. Alavi, William G. Buttlar, Data Analytics for Smart Cities, 2018
Mohammad Jalayer, Huaguo Zhou, Subasish Das
To determine the most significant contributing factors, and then develop effective safety countermeasures, these numbers require further analysis. A major challenge for state and local agencies is to find patterns in these huge databases. Exploratory data analysis (EDA) is an approach by which patterns, changes, and anomalies in large datasets may be determined, beyond the hypothesis testing task or formal modeling (Cook and Swayne 2007; Chatfield 1995). Using a variety of mostly graphical techniques (e.g., box plot, scatter plot, multiple correspondence analysis, and principal component analysis), EDA can extract specific information from datasets and transform it into an understandable structure. Since ROR crashes accounted for the majority of RwD events (about 80 percent), this study uses multiple correspondence analysis (MCA) to identify the key factors contributing to ROR collisions related to the roadway and roadside geometric design features of rural two-lane roads. The MCA method identifies patterns in complex datasets and measures significant contributing factors and their degree of association. To employ this method, datasets from the United States Road Assessment Program (usRAP), a program of the American Automobile Association Foundation for Traffic Safety, were obtained and 5 years (2009–2013) of ROR crash data in Illinois were gathered. To achieve the program’s Toward Zero Deaths vision, agencies are working to decrease the frequency and severity of RwD crashes. The results of this study can help researchers and transportation agencies to get a better knowledge of the major contributing factors to ROR crashes and prioritize the locations where safety countermeasures should be implemented (e.g., signage, pavement safety measures, and roadside design improvements).
Exploratory Data Analytics and PCA-Based Dimensionality Reduction for Improvement in Smart Meter Data Clustering
Published in IETE Journal of Research, 2023
Analysis of high-frequency smart meter data is required for finding the key insights necessary for the reliable operation of smart grids especially load prediction and demand response management. In this work, a technique is suggested for clustering similar load profiles so that the results can be used for targeting the users eligible for demand response management. “Irish Smart Meter Database” is used for analyzing and implementing the Machine Learning feature extraction and clustering techniques. As it is essential to understand the distribution of data before implementing any ML model exploratory data analysis (EDA) is carried out. It is justified in this work through EDA that taking an epoch of 16 samples at a time for feature extraction rather than the data of the entire day is a comparatively more optimal approach. It is also shown that the Time of Use and Day of Use characteristics should be considered while clustering the load profiles. For feature extraction, PCA is used in this work taking 6 principal components as features (identified by cumulative variance). The dataset is reduced by around 64% using the PCA feature extraction technique. It is observed that the results of k-means clustering are better and faster through this method as compared to clustering of raw dataset. The optimal number of clusters (k = 5 for residential users and k = 3 for SME users) are found by using WCSS based “Elbow method” and by finding the average silhouette coefficient.
Exploratory framework for analysing road traffic accident data with validation on Gauteng province data
Published in Cogent Engineering, 2020
Tebogo Makaba, Wesley Doorsamy, Babu Sena Paul
Exploratory data analysis (EDA) has been widely used in research with literature thereof employing different graphical representations and statistical analyses, to perform preliminary investigations on datasets. EDA is well known as an approach that can be used to examine datasets to identify and uncover hidden patterns and answer some important questions (Martinez et al., 2010). The idea behind EDA is to obtain a background context of the dataset to be able to develop an appropriate prediction model. EDA approach can be employed to identify important variables, detect outliers and spot anomalies in the dataset (Martinez et al., 2010). Thus, EDA can be classified into four groups (Chambers, 2018; DuToit et al., 2012): the non-graphical univariate and multivariate methods mainly involve the calculation of summary statistics while the graphical univariate and multivariate methods use some graphical ways to summarise analyse and present the dataset. Furthermore, the univariate methods focus on two or more variables timely to discover their relation and the multivariate methods focuses only on two variables, or in some cases, it can expand to more than two variables. EDA is the best practice that can be applied in different domains such as anomaly detection, speech recognition, fraud detection, etc.
Multivariate comparison of photocatalytic properties of thirteen nanostructured metal oxides for water purification
Published in Journal of Environmental Science and Health, Part A, 2019
Jakub Trawiński, Robert Skibiński
The PCA is a chemometric technique, widely used for the exploratory data analysis, samples relationships visualization (e.g. detection of clusters and outliers), and also regression and classification. Its principle relies on conversion of the original variables to the equal number of latent variables (principal components, PCs), which are orthogonal, and explain the largest percent of the data variance. PCs are constructed as the linear combination of the original variables. This process could be seen as a projection of the latent data structure (matrix X) onto two subspaces – score matrix T and loading matrix P (Eq. (1), E denotes the residual matrix).[66]