Explore chapters and articles related to this topic
Exploratory Visualizations
Published in Max Kuhn, Kjell Johnson, Feature Engineering and Selection, 2019
A drawback of the box plot is that it is not effective at identifying distributions that have multiple peaks or modes. As an example, consider the distribution of ridership at the Clark/Lake station (Figure 4.3). Part (a) of this figure is a histogram of the data. To create a histogram, the data are binned into equal regions of the variable’s value. The number of samples are counted in each region, and a bar is created with the height of the frequency (or percentage) of samples in that region. Like box plots, histograms are simple to create, and these figures offer the ability to see additional distributional characteristics. In the ridership distribution, there are two peaks, which could represent two different mechanisms that affect ridership. The box plot (b) is unable to capture this important nuance. To achieve a compact visualization of the distribution that retains histogram-like characteristics, Hintze and Nelson (1998) developed the violin plot. This plot is created by generating a density or distribution of the data and its mirror image. Figure 4.3 (c) is the violin plot, where we can now see the two distinct peaks in ridership distribution. The lower quartile, median, and upper quartile can be added to a violin plot to also consider this information in the overall assessment of the distribution.
Exploratory Data Analysis and Data Visualization
Published in Chong Ho Alex Yu, Data Mining and Exploration, 2022
However, data reduction is not always the best choice. On some occasions it is advisable to add details into the graph in order to obtain a more comprehensive perspective of the data. Superimposing violin plots on box plots is such an example. While boxplots can both condense all information into five points and detect outliers, violin plots supplement the former with the distributional information. Therefore, the two can complement each other. In short, the data visualizer must balance the needs of presenting noise (details) and smoothness (reduced data). Some users who get used to static graphs tend to stay with the default view, but modern data science necessitates exploratory graphics.
Distribution Shapes
Published in Wendy L. Martinez, Angel R. Martinez, Jeffrey L. Solka, Exploratory Data Analysis with MATLAB®, 2017
Wendy L. Martinez, Angel R. Martinez, Jeffrey L. Solka
The probability density function can be estimated using any approach one would like to use. For instance, we could use the kernel density estimation method or histograms. Based on this, we see that there is some similarity between violin plots and the variations of the boxplot described previously.
Estimation and interpretation of equilibrium scour depth around circular bridge piers by using optimized XGBoost and SHAP
Published in Engineering Applications of Computational Fluid Mechanics, 2023
Nasrin Eini, Sayed M. Bateni, Changhyun Jun, Essam Heggy, Shahab S. Band
The violin plot in Fig. 5a shows the distribution of scour depth estimates from RPSO–XGBoost (as the best proposed model) in comparison with those obtained using two famous models (i.e. HEC-18 and Sheppard et al. (2014)) and the best method among the 28 considered existing methods (i.e. Shamshirband et al. (2020)). A similar comparison is made for dse/Y in Fig. 5b. In general, violin plots combine a kernel density plot with a box plot. Unlike box plots, violin plots can show both the statistic and density of data, thereby explaining the variability of data. The small white dot in the violin plots in Fig. 5 represents the median of the data. The interquartile range is presented by a thick black line. The thin black line indicates the data beyond the interquartile range except outliers. The shape of the data distribution is shown by the kernel density approximation on the two sides of the black line. The wider shape of the violin plot around the median denotes a high concentration of data in this region. The tapered shape of the ends of the violin plot indicates a lower concentration of data in that area.
A novel pedestrian road crossing simulator for dynamic traffic light scheduling systems
Published in Journal of Intelligent Transportation Systems, 2023
Dayuan Tan, Mohamed Younis, Wassila Lalouani, Shuyao Fan, Guozhi Song
Figure 18 shows the violin plots of the distributions of vehicle and pedestrian waiting time; their max waiting times are annotated as well. In a violin plot, a wider area means higher data density. For example, the leftmost bar in the bottom plot shows the distribution of pedestrian waiting time in our first experiment. The volume of pedestrians who waited for about 48 and 80 s are the largest, as indicated by the width of the bar. The bar above 100 s is very narrow meaning that very few pedestrians waited for longer than 100 s. The TLS with PCS (the third experiment) significantly reduces the max waiting time for both vehicles and pedestrians, where the bar is the shortest among the three experiments. By factoring in pedestrians, enough time is allotted for road crossing while calculating the duration for the next phase. This improves intersection crossing safety; yet, the maximum vehicle waiting time is reduced by not overestimating what constitutes safe crossing time. These improvements illustrate that PCS plays an important role in improving the performance of the TLS.
Forecast of rainfall distribution based on fixed sliding window long short-term memory
Published in Engineering Applications of Computational Fluid Mechanics, 2022
Chengcheng Chen, Qian Zhang, Mahsa H. Kashani, Changhyun Jun, Sayed M. Bateni, Shahab S. Band, Sonam Sandeep Dash, Kwok-Wing Chau
Figure 10(a) shows the evaluation metrics (correlation, RMSE and standard deviation) in the form of a Taylor diagram for the LSTM and RF models in the testing period. It can be found that the point of the LSTM model is closer to the observed point (in blue) compared to the RF model’s point (2.248 and 2.512, respectively), and this indicates the higher ability of the LSTM model (with high and low values of R and RMSE, respectively) than the RF technique. Figure 10(b) and (c) show violin and box plots of the models, respectively. A box plot displays variation (such as minimum and maximum of data) in a data set. A violin plot is like a box plot; however, it presents the kernel probability density for different values of the modeled and actual data. Figure 10(b) and (c) show that the violin and box plots of the LSTM model are more similar to both plots of the actual data sets, compared with the RF model. This means that the statistical characteristics of the forecast rainfall values of the LSTM model are more similar to the statistical characteristics of the actual data. In other words, the LSTM model is more successful in rainfall modeling than the RF model at Rize Station. It should be noted that this conclusion is drawn by comparing the results of Figures 8 and 9.