Explore chapters and articles related to this topic
Decoding Common Machine Learning Methods
Published in Himansu Das, Jitendra Kumar Rout, Suresh Chandra Moharana, Nilanjan Dey, Applied Intelligent Decision Making in Machine Learning, 2020
Srinivasagan N. Subhashree, S. Sunoj, Oveis Hassanijalilian, C. Igathinathane
Among the selected ML methods, LDA guarantees minimum classification error when each class in the dataset is normally distributed. Therefore, the accuracy achieved by the LDA model was based on the normality spread of data. The normality test conducted with the Shapiro–Wilk normality test using the R function shapiro.test() revealed that the selected features in soybean aphid datasets were normally distributed (H0: normal; p > 0.23, not significant). However, the weed species dataset was not normally distributed (H0: normal; p < 3.45 × 10−5, significant), therefore LDA is not suitable, while the kNN handles such datasets. Therefore, to yield good model performance, LDA for the soybean aphids and kNN for the weed species dataset were employed for demonstration.
Empirical assessment of the forms of corruption in infrastructure project procurement
Published in Emmanuel Kingsford Owusu, Albert P. C. Chan, Corruption in Infrastructure Procurement, 2020
Emmanuel Kingsford Owusu, Albert P. C. Chan
Regarding the normality test, Kim (2015) opined that almost every statistical test requires data to be normally distributed. However, this is not always the case, as the researchers do not have control over how the data should turn out. This is the rationale behind the determination of the data normality as the distribution of the data influences some of the tools to be employed to analyze the data further. As commonly conducted in other studies (e.g., Gel et al. 2007; Shan et al. 2017), the Shapiro–Wilk test tool is employed to determine the normality of the data (Olawumi and Chan 2018). In determining the distribution, the null hypothesis states that “the data are normally distributed at a significance level of 0.05.” Hence the hypothesis is rejected if the actual values of the individual variables are lower than the estimated significance level (i.e., 0.05). In this scenario, the conclusion indicating the non-normal distribution of the data can be drawn. Similar values of 0.000 were generated for all the variables, which concludes that the data are non-normally distributed.
Impacts and policy implications of heavy metals effluent discharges into rivers within industrial Zones: A sub-Saharan perspective from Ethiopia
Published in Eskinder Zinabu Belachew, Estimating Combined Loads of Diffuse and Point-Source Pollutants into the Borkena River, Ethiopia, 2019
All water quality data analyses were performed in R statistical packages (R Core Team, 2015). Normality of the data was first tested using a Shapiro-Wilk normality test (Degens and Donohue, 2002; Shapiro and Wilk, 1965), in order to choose the required statistical methods for further data analysis. Descriptive statistics were carried out for the results of the sample analyses. Here the data set for each station was found to be asymmetrically distributed with the mean values affected by a few high or low values (Table 2.1.). To best summarize these data sets, median values were selected for better representation of central tendency concentrations at each station (Bartley et al., 2012). These median values were compared with environmental guidelines.
Data-driven assessment on the corporate credit scoring mechanism for Chinese construction supervision companies
Published in Construction Management and Economics, 2023
Jun Wang, Xiaodong Li, Ashkan Memari, Martin Skitmore, Yuying Zhong, Baabak Ashuri
Descriptive statistics are first summarized to understand the frequency, mean, standard deviation, range, and distribution of CSC credit scores for each evaluation period since 2014. The total frequency represents the total number of CSCs participating in the credit evaluation during each period. The Shapiro–Wilk test (Shapiro and Wilk 1965) is conducted to check the data for normality as it is considered the most powerful compared with other normality tests (Razali and Wah 2011). To answer the first research question, the average credit scores of the 20% highest, the 20% lowest, and all CSCs are calculated. The non-parametric (distribution-free) Mann–Kendall (MK) test (Mann 1945, Kendall 1975) with an existing Python package (Hussain and Mahmud 2019) is used to assess whether to accept or reject the null hypothesis (i.e. the average of credit scores has no monotonic trend, which is measured by the cumulative comparison of each later-observed average score to all average scores observed earlier). The MK test is also applied to the total frequency to assess whether to accept or reject the null hypothesis (i.e. the number of CSCs participating in the credit evaluation has no monotonic trend, which is measured by the cumulative comparison of each later-observed total frequency to all total frequencies observed earlier).
A shift to green cybersecurity sustainability development: Using triple bottom-line sustainability assessment in Qatar transportation sector
Published in International Journal of Sustainable Transportation, 2023
Khalifa AL-Dosari, Noora Fetais, Murat Kucukvar
For assessing an (SEM) structural equation modeling, the normality test assumes that the constructs and their items are normally distributed (Swiatkowski et al., 2020). Implementing the most appropriate normalcy tests is a significant responsibility for research. The study employed skewness and kurtosis normality tests to fix this issue, which are useful for large and small data. The normalcy test examined the data’s accuracy before conducting the structural equation modeling test (Kline, 2005). The normality tests are used to determine if the data gathered is normal. Depending on the composition of the values on either side, the skewness values are positive (-) or negative (+). Additionally, left-side skewness is indicated by a negative sign (-), while a positive (+) sign indicates right-side skewness. According to Kline (2005), the skewness threshold should be between −2 and +2, and the Kurtosis value should be between −7 and +7. Because the dimensions are unprecedented in the Qatar transportation context, it was necessary to evaluate whether the data are normally distributed.
Quantifying the non-normality of shear strength of geomaterials
Published in European Journal of Environmental and Civil Engineering, 2020
Obviously, the adoption of relatively small numbers of data points will produce a large estimation error when the random model is fitted. Although small number of values (such as less than 10 points, as specified by Hallam, 1990) might provide a rough estimate of the centre of a distribution, the apparent spread must be treated with caution. As claimed by Öztuna et al. (2006), in the case of a small sample size, normality tests have less power to reject the null hypothesis that the data follow a normal distribution. The identified PDCs can also be meaningless if a very small number of observed samples are used. A similar phenomenon can be found in the univariate case, where the shape of the histogram is not meaningful for small sample sizes. The limitations of the technique proposed here pertain to the size of problems for which scenarios are generated. An accurate shear strength characterisation requires availability of large data sets, and the sample size with around 20 to 50 can be desirable (Gill, Corthésy, & Leite, 2005; Öztuna et al., 2006; Wu, 2013a). With the limited amount of data available for the GCL materials, a general conclusion regarding normality cannot be made. Future research should be encouraged to carry out tests that focus on exploring their uncertainties.