K nearest neighbors – Knowledge and References

Explore chapters and articles related to this topic

Decoding Common Machine Learning Methods

Published in Himansu Das, Jitendra Kumar Rout, Suresh Chandra Moharana, Nilanjan Dey, Applied Intelligent Decision Making in Machine Learning, 2020

Srinivasagan N. Subhashree, S. Sunoj, Oveis Hassanijalilian, C. Igathinathane

The calculated Euclidean distance value is sorted in ascending order, and “k” closest data points are selected from the sorted results. The class of the “k” sorted points that hold the maximum frequency is assigned as the class of the test data instance. The “k” value plays a crucial role in determining the predictive capability of the kNN algorithm. A rule-of-thumb approach for fixing the “k” value is by estimating the square root value of the total training data observations. It is important to note that the predictive performance of the kNN model also depends on the scale of the features. Feature scaling procedures such as normalization or standardization should be performed before employing the kNN algorithm for features existing at different scales (units); this step, however, can be disregarded if the features already have the same units. The pseudocode decoding the kNN algorithm is presented (Algorithm 2.3).

Handling Missing Data

View Chapter

Purchase Book

Published in Max Kuhn, Kjell Johnson, Feature Engineering and Selection, 2019

Max Kuhn, Kjell Johnson

When the training set is small or moderate in size, K-nearest neighbors can be a quick and effective method for imputing missing values (Eskelson et al., 2009; Tutz and Ramzan, 2015). This procedure identifies a sample with one or more missing values. Then it identifies the K most similar samples in the training data that are complete (i.e., have no missing values in some columns). Similarity of samples for this method is defined by a distance metric. When all of the predictors are numeric, standard Euclidean distance is commonly used as the similarity metric. After computing the distances, the K closest samples to the sample with the missing value are identified and the average value of the predictor of interest is calculated. This value is then used to replace the missing value of the sample.

Network anomaly detection using a fuzzy rule-based classifier

View Chapter

Purchase Book

Published in Debatosh Guha, Badal Chakraborty, Himadri Sekhar Dutta, Computer, Communication and Electrical Technology, 2017

Soumadip Ghosh, Arindrajit Pal, Amitava Nag, Shayak Sadhu, Ramsekher Pati

In this work, we use a fuzzy rule-based classifier (RIPPER) as the primary classifier and compare its performance with two more classifiers, namely K-Nearest Neighbor (KNN) (Altman 1992) and Support Vector Machine (SVM) (Cortes 1995), which have been chosen as secondary classifiers. KNN uses Euclidean distance to calculate the distance among the neighbors, and uses the nearest distance to classify data. SVM uses dot product as the kernel function, and uses sequential minimal optimization to separate the hyperplane. These two classifiers are chosen to compare its performance with the fuzzy rule-based classifier. KNN is a lazy classifier that produces good result when the available dataset is large. The dataset that we have used is fairly large (containing more than 25,000 tuples), and thus KNN is selected. SVM is a supervised classification model that works well with large datasets. SVM is built based on the theory of hyperplane, in which it creates an n-dimensional space. In this case, n is the number of attributes in the dataset. As we have a small number of attributes, the use of this classifier would produce good result.

Inverse Estimation of Breast Tumor Size and Location with Numerical Thermal Images of Breast Model Using Machine Learning Models

View Article

Journal Information

Published in Heat Transfer Engineering, 2023

Gonuguntla Venkatapathy, Anuj Mittal, Nagarajan Gnanasekaran, Vijay H. Desai

K nearest neighbor regression (KNNR) is used for classification and regression. KNN works by finding out neighboring points using a similarity metric. The most common similarity metric used by KNN is the Euclidean distance between the unknown point and the other points in the dataset. Eq. (6) provides a general formula for calculating Euclidean distance y1 to yn are the attribute values of one observation, whereas x1 to xn are the attribute values of the other observation. KNN assumes that similar things exist in close proximity. This means that the output of a new input data point is calculated as some average of the output data points in proximity to the new point. Of these points, K, such points that most closely resemble the new point are considered. In the case of thermography, the KNNR looks for temperature distributions in the known dataset that closely resemble the temperature distribution of a breast surface that is to be diagnosed [44].

Analysis of Suitable Machine Learning Imputation Techniques for Arthritis Profile Data

View Article

Journal Information

Published in IETE Journal of Research, 2022

Uma Ramasamy, Sundar Santhoshkumar

Each machine learning-based imputation technique has its own merits and demerits. The kNN imputation replaces the missing value using the k-Nearest Neighbors (kNN) approach. Euclidean distance is the mandatory metric to support missing values to find the nearest neighbours. After finding the average of selected minimum values from the ‘k’ nearest neighbours, each missing feature value is imputed. Then, the square root of the product of weight and sum of Euclidean distance for all the instances are calculated by distance to each neighbour. Finally, the average of selected minimum values from the ‘k’ nearest neighbours is determined uniformly. The kNN imputation works well for the small dataset. However, it is sensitive to outliers because it uses Euclidean distance. Moreover, it is computationally expensive and cannot be applied to categorical data.

Instance-based learning of marker proteins of carcinoma cells for cell death/ survival

View Article

Journal Information

Published in Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 2020

Shruti Jain, D. S. Chauhan

k-NN is also known as Lazy Learning, Instance/Case/Memory/Example-Based Reasoning. As the size of training dataset increases, the computational complexity of kNN increases. If the prediction is depending on mean and median of the k-similar instance, then the kNN is used for regression problems while if the output is calculated as the class with the highest frequency from the k-similar instance than the k-NN is used for classification. Let’s define a function (u, v), where u & v are two elements or case sample. The value of distance function is a real positive value which is defined by the Cartesian product of u and v for a set D. The different approaches for calculating distance are Euclidean distance, City block/Manhattan distance, Chebyshev distance, and Minkowski distance function.