Explore chapters and articles related to this topic
Decoding Common Machine Learning Methods
Published in Himansu Das, Jitendra Kumar Rout, Suresh Chandra Moharana, Nilanjan Dey, Applied Intelligent Decision Making in Machine Learning, 2020
Srinivasagan N. Subhashree, S. Sunoj, Oveis Hassanijalilian, C. Igathinathane
The calculated Euclidean distance value is sorted in ascending order, and “k” closest data points are selected from the sorted results. The class of the “k” sorted points that hold the maximum frequency is assigned as the class of the test data instance. The “k” value plays a crucial role in determining the predictive capability of the kNN algorithm. A rule-of-thumb approach for fixing the “k” value is by estimating the square root value of the total training data observations. It is important to note that the predictive performance of the kNN model also depends on the scale of the features. Feature scaling procedures such as normalization or standardization should be performed before employing the kNN algorithm for features existing at different scales (units); this step, however, can be disregarded if the features already have the same units. The pseudocode decoding the kNN algorithm is presented (Algorithm 2.3).
Handling Missing Data
Published in Max Kuhn, Kjell Johnson, Feature Engineering and Selection, 2019
When the training set is small or moderate in size, K-nearest neighbors can be a quick and effective method for imputing missing values (Eskelson et al., 2009; Tutz and Ramzan, 2015). This procedure identifies a sample with one or more missing values. Then it identifies the K most similar samples in the training data that are complete (i.e., have no missing values in some columns). Similarity of samples for this method is defined by a distance metric. When all of the predictors are numeric, standard Euclidean distance is commonly used as the similarity metric. After computing the distances, the K closest samples to the sample with the missing value are identified and the average value of the predictor of interest is calculated. This value is then used to replace the missing value of the sample.
Network anomaly detection using a fuzzy rule-based classifier
Published in Debatosh Guha, Badal Chakraborty, Himadri Sekhar Dutta, Computer, Communication and Electrical Technology, 2017
Soumadip Ghosh, Arindrajit Pal, Amitava Nag, Shayak Sadhu, Ramsekher Pati
In this work, we use a fuzzy rule-based classifier (RIPPER) as the primary classifier and compare its performance with two more classifiers, namely K-Nearest Neighbor (KNN) (Altman 1992) and Support Vector Machine (SVM) (Cortes 1995), which have been chosen as secondary classifiers. KNN uses Euclidean distance to calculate the distance among the neighbors, and uses the nearest distance to classify data. SVM uses dot product as the kernel function, and uses sequential minimal optimization to separate the hyperplane. These two classifiers are chosen to compare its performance with the fuzzy rule-based classifier. KNN is a lazy classifier that produces good result when the available dataset is large. The dataset that we have used is fairly large (containing more than 25,000 tuples), and thus KNN is selected. SVM is a supervised classification model that works well with large datasets. SVM is built based on the theory of hyperplane, in which it creates an n-dimensional space. In this case, n is the number of attributes in the dataset. As we have a small number of attributes, the use of this classifier would produce good result.
Inverse Estimation of Breast Tumor Size and Location with Numerical Thermal Images of Breast Model Using Machine Learning Models
Published in Heat Transfer Engineering, 2023
Gonuguntla Venkatapathy, Anuj Mittal, Nagarajan Gnanasekaran, Vijay H. Desai
K nearest neighbor regression (KNNR) is used for classification and regression. KNN works by finding out neighboring points using a similarity metric. The most common similarity metric used by KNN is the Euclidean distance between the unknown point and the other points in the dataset. Eq. (6) provides a general formula for calculating Euclidean distance y1 to yn are the attribute values of one observation, whereas x1 to xn are the attribute values of the other observation. KNN assumes that similar things exist in close proximity. This means that the output of a new input data point is calculated as some average of the output data points in proximity to the new point. Of these points, K, such points that most closely resemble the new point are considered. In the case of thermography, the KNNR looks for temperature distributions in the known dataset that closely resemble the temperature distribution of a breast surface that is to be diagnosed [44].
Analysis of Suitable Machine Learning Imputation Techniques for Arthritis Profile Data
Published in IETE Journal of Research, 2022
Uma Ramasamy, Sundar Santhoshkumar
Each machine learning-based imputation technique has its own merits and demerits. The kNN imputation replaces the missing value using the k-Nearest Neighbors (kNN) approach. Euclidean distance is the mandatory metric to support missing values to find the nearest neighbours. After finding the average of selected minimum values from the ‘k’ nearest neighbours, each missing feature value is imputed. Then, the square root of the product of weight and sum of Euclidean distance for all the instances are calculated by distance to each neighbour. Finally, the average of selected minimum values from the ‘k’ nearest neighbours is determined uniformly. The kNN imputation works well for the small dataset. However, it is sensitive to outliers because it uses Euclidean distance. Moreover, it is computationally expensive and cannot be applied to categorical data.
Instance-based learning of marker proteins of carcinoma cells for cell death/ survival
Published in Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 2020
k-NN is also known as Lazy Learning, Instance/Case/Memory/Example-Based Reasoning. As the size of training dataset increases, the computational complexity of kNN increases. If the prediction is depending on mean and median of the k-similar instance, then the kNN is used for regression problems while if the output is calculated as the class with the highest frequency from the k-similar instance than the k-NN is used for classification. Let’s define a function (u, v), where u & v are two elements or case sample. The value of distance function is a real positive value which is defined by the Cartesian product of u and v for a set D. The different approaches for calculating distance are Euclidean distance, City block/Manhattan distance, Chebyshev distance, and Minkowski distance function.