Explore chapters and articles related to this topic
Greedy Search Methods
Published in Max Kuhn, Kjell Johnson, Feature Engineering and Selection, 2019
When the predictor is numeric, the following options exist: When the outcome is categorical, the same tests can be used in the case above where the predictor is categorical and the outcome is numeric. The roles are simply reversed in the t-test, curve calculations, F-test, and so on. When there are a large number of tests or if the predictors have substantial multicollinearity, the correlation-adjusted t-scores of Opgen-Rhein and Strimmer (2007) and Zuber and Strimmer (2009) are a good alternative to simple ANOVA statistics.When the outcome is numeric, a simple pairwise correlation (or rank correlation) statistic can be calculated. If the relationship is nonlinear, then the maximal information coefficient (MIC) values (Reshef et al., 2011) or A statistics (Murrell et al., 2016), which measure strength of the association between numeric outcomes and predictors, can be used.Alternatively, a generalized additive model (GAM) (Wood, 2006) can fit nonlinear smooth terms to a set of predictors simultaneously and measure their importance using a p-value that tests against a null hypothesis that there is no trend of each predictor. An example of such a model with a categorical outcome was shown in Figure 4.15(b).
Knowledge-infused process monitoring for quality improvement in solar cell manufacturing processes
Published in Journal of Quality Technology, 2022
This section aims to identify key features that are highly dependent on SCE among all the extracted features by using the variable selection method. Many techniques have been developed to address variable selection problems. The current methods can be classified into two categories. The first category is key variable selection based on linear correlation. For example, least absolute shrinkage and selection operator (lasso) (Tibshirani 2011) and adaptive lasso (Zou 2006), Huber lasso (Owen 2007; Lambert–Lacroix and Zwald 2011), and Tukey lasso (Chang, Roberts, and Welsh 2018) are used to select key variables. However, these lasso-based key variable selection methods can only identify features with linear correlations. Many real cases demonstrate that quality variables may have other types of dependency with features, such as nonlinear correlation or dependent but uncorrelated scenarios. The second category of methods is used to address this issue. Mutual information (Brown 2009; Vergara and Estévez 2014), maximal information coefficient (MIC) (Reshef et al. 2011), and distance correlation-based key variable selection methods (Szekely, Rizzo, and Bakirov 2007; Yenigün and Rizzo 2015; Kong, Wang, and Wahba 2015) were proposed. Simon et al. (2014) performed various simulations to compare MIC with standard Pearson correlation and distance correlation, and the results showed that distance correlation is the most powerful dependency measure.
A data-driven prediction model for aircraft taxi time by considering time series about gate and real-time factors
Published in Transportmetrica A: Transport Science, 2023
Fujun Wang, Jun Bi, Dongfan Xie, Xiaomei Zhao
In our study, information divergence in RFR is equivalent to mutual information in the maximal information coefficient (MI). According to the theory of entropy, MI was calculated by where is the set of and .
Development of an efficient input selection method for NN based streamflow model
Published in Journal of Applied Water Engineering and Research, 2023
Alireza B. Dariane, Mohamadreza M. Behbahani
Neural network models are used to solve various problems in hydrology and water resources (Govindaraju 2000; Tayfur and Singh 2011; Zhang et al. 2016; Ahani et al. 2018; Hadi and Tombul 2018; Turner et al. 2020). These include applications such as rainfall forecasting (Nourani et al. 2009a, 2009b; Ramana et al. 2013; Hosseini et al. 2020; Ni et al. 2020), streamflow modeling (Maier and Dandy 2000; Cannas et al. 2006; Adamowski 2008; Chang and Chen 2018; Zakhrouf et al. 2018; Zhang et al. 2018; Huang et al. 2019; Ghaith et al. 2020; Wu et al. 2020), sediment transportation (Kisi et al. 2012; Zounemat-Kermani et al. 2016; Adnan et al. 2019; Chen and Chau 2019), rainfall-runoff modeling (Abrahart et al. 2012; Young et al. 2017; Bartoletti et al. 2018), and reservoir operation (Dariane and Karami 2014; Dariane and Moradi 2016; Khadr and Schlenkhoff 2018; Ahmad and Hossain 2019). An important problem before using an NNSSM is the selection of suitable inputs (Maier and Dandy 2000; Maier et al. 2010; Devi and Sabrigiriraj 2018; Mohamad et al. 2020). Different types of IVS techniques solve this problem (May et al. 2011; Galelli and Castelletti 2013; Remesan et al. 2018). If the problem at hand is small, it might be possible to select the inputs by inspection and through the trial and error method. However, in large-scale problems, this would be very time consuming and hard to determine the suitable variables from among a huge number of possibilities. According to the principle of parsimony, selecting less numbers of inputs must be pursued besides getting the best results. Meanwhile, using unnecessary and additional inputs may not have any sensible improvement on the accuracy of the model output, but it will certainly increase the time consumption and risk of errors (Jain and Kumar 2009). Overall, the input selection methods can be divided into two categories: Model-Free and Model-Based techniques (Maier et al. 2010; May et al. 2011; Tran et al. 2015; Snieder et al. 2020). Model-Free input selection methods select inputs independent of models and act as an initial filter (Kohavi and John 1997). The traditional Model-Free methods such as mutual information (MI) and cross–correlation and the lately proposed examples like Proportional Rough Feature Selector (PRFS) and Orthogonal Maximal Information Coefficient Feature Selection (OMICFS) select inputs based on probabilistic parameters (Coulibaly et al. 2000; Imrie et al. 2000; Bhattacharya and Solomatine 2005; Lyu et al. 2017; Cekik and Uysal 2020).