Instance selection – Knowledge and References

Explore chapters and articles related to this topic

Signature Generation Algorithms for Polymorphic Worms

Published in Mohssen Mohammed, Al-Sakib Khan Pathan, Automatic Defense Against Zero-day Polymorphic Worms in Communication Networks, 2016

Mohssen Mohammed, Al-Sakib Khan Pathan

After collecting the dataset, the second step is the data preparation and data preprocessing. Depending on the circumstances, researchers can choose from a number of methods to handle missing data [15]. Hodge and Austin [16] have introduced a survey of contemporary techniques for outlier (noise) detection. These researchers have identified the techniques’ advantages and disadvantages. Instance selection is used not only to handle noise but also to cope with the infeasibility of learning from very large datasets. Instance selection in these datasets is an optimization problem that attempts to maintain the mining quality while minimizing the sample size [17]. It reduces data and enables a data-mining algorithm to function and work effectively with very large datasets. There is a variety of procedures for sampling instances from a large dataset [18].

Introduction

View Chapter

Purchase Book

Published in Sankar K. Pal, Pabitra Mitra, Pattern Recognition Algorithms for Data Mining, 2004

Sankar K. Pal, Pabitra Mitra

• Sampling/instance selection: Various random, deterministic and density biased sampling strategies exist in statistics literature. Their use in machine learning and data mining tasks has also been widely studied [37, 114, 142]. Note that merely generating a random sample from a large database stored on disk may itself be a non-trivial task from a computational viewpoint. Several aspects of instance selection, e.g., instance representation, selection of interior/boundary points, and instance pruning strategies, have also been investigated in instance-based and nearest neighbor classification frameworks [279]. Challenges in designing an instance selection algorithm include accurate representation of the original data distribution, making fine distinctions at different scales and noticing rare events and anomalies.

A Quasi-Oppositional Based Flamingo Search Algorithm Integrated with Generalized Ring Crossover for Effective Feature Selection

View Article

Journal Information

Published in IETE Journal of Research, 2023

Revathi Durgam, Nagaraju Devarakonda

The Binary Chemical Reaction Optimization (BCRO) approach has been introduced for picking the subset of features from the dataset. In addition, to enhance the performance of the BCRO in picking the subset, the Tabu Search (TS) was integrated with BCRO. The classifier K-nearest neighbours (KNN) computes the fitness for the selected features and analyzes it. The nine public biomedical datasets were utilized to evaluate the performance of the BCROTS-KNN. Finally, support vector machine (SVM) and naïve Bayes were also used for measuring the classification accuracy by Yan and Luo [8]. G. I. Sayed and G. Khoriba [9] used a dataset instance selection approach to diminish the dimension. The instance selection integrated with the feature extraction diminishes the data and lessens the computation time during the training of the classifier. The above classifiers still lack in its robustness and so enhancement is needed to make correct predictions given noisy data or data with missing values and different category datasets.

Instances selection algorithm by ensemble margin

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2018

Meryem Saidi, Mohammed El Amine Bechar, Nesma Settouti, Mohamed Amine Chikh

In data mining, the execution time of the learning process is important; a too long time yields to an increase in computational cost. The instance selection step allows us to reduce the dataset size by eliminating the noisy and redundant instances. In this paper, we propose an instance selection approach based on the ensemble margin, a fundamental concept of ensemble learning to address the problem of dimensionality reduction without high performance degradation.

The interpretive model of manufacturing: a theoretical framework and research agenda for machine learning in manufacturing

View Article

Journal Information

Published in International Journal of Production Research, 2021

Ajit Sharma, Zhibo Zhang, Rahul Rai

Data collection can be a loosely controlled process which might result in poor data quality like missing values, incomplete data, and out-of-range values. In addition, data can also be unusable because of unmanageable sizes in dimensions, instances, and features. Data preprocessing is the task of taking such raw data and processing it through a sequence of steps to make it suitable for use in ML algorithms. Data preprocessing consists of two broad types of tasks: data quality improvement tasks and data reduction tasks. Tasks for improving data quality include data cleaning, data imputation, data normalisation, noise identification, and data integration. Data reduction tasks comprise of a set of techniques that help obtain a reduced representation of the original data. Data reduction techniques include feature selection, instance selection, and discretisation. Feature selection reduces the dimensionality of data. Instance selection reduces the number of instances by selecting a subset of the total available data. This is done by removing redundant or conflicting instances. Discretisation simplifies the domain of an attribute by transforming quantitative data into qualitative data. Different categories of data preprocessing techniques are illustrated in Figure 4. The result expected after taking data through preprocessing tasks is a final dataset, which is usable by interpretive algorithms. In the papers reviewed, 181 mentioned data preprocessing steps. Data preprocessing serves several functions in the ML pipeline. First, it is used to improve the quality of the data by removing null values, standardisation, denoising, data cleaning, and padding. Second, pre-processing is used for feature extraction. For instance, in time series data, one can extract features like time domain feature (Tsai, Chen, and Lou 1999; Xie et al. 2015), frequency domain feature (Abouelatta and Madl 2001), and time-frequency domain feature (Wang et al. 2016). For image data one can use methods like normalisation, whiting, scaling, padding, filtering, and rotating (Malhotra et al. 2015; Chien, Hsu, and Chen 2013; Weiss et al. 2016; Wu et al. 2018). For CAD models, one can use techniques like mesh repairing. Table 6b provides a tabulation of pre-processing techniques used (Balu et al. 2016).