Explore chapters and articles related to this topic
Similarity Principle—The Fundamental Principle of All Sciences
Published in Mark Chang, Artificial Intelligence for Drug Development, Precision Medicine, and Healthcare, 2020
A similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance matrices: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. For example, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other. In the context of cluster analysis, Frey and Dueck suggest defining a similarity measure ()(x,y)=−‖x−y‖2
Nearest Neighbors
Published in Jan Žižka, František Dařena, Arnošt Svoboda, Text Mining with Machine Learning, 2019
Jan Žižka, František Dařena, Arnošt Svoboda
Hamming distance: A very simple distance between two strings (or vectors) of equal length, which is the number of positions at which the corresponding symbols are either equal or different. The more different the symbols, the less similar (or more distant) the strings. The Hamming distance, named after a mathematician Richard Wesley Hamming (11th February 1915 - 7th January 1998), can also be used to determine the similarity or difference between two text documents when the representation of words is binary (for example, using symbols 1 and 0, for when a word occurs = 1, and a word does not occur = 0) provided that all such represented documents employ the same joint vocabulary (which guarantees the required same length of vector).
Applications in Parallel Computing, AI, etc
Published in Khodakhast Bibak, Restricted Congruences in Computing, 2020
Measuring the similarity between objects is an important problem with many applications in various areas. Much of natural language processing (NLP) is concerned with measuring how similar two strings are [97]. The most important measure of similarity of two strings (words) is the edit distance (also known as the Levenshtein distance; see [121]), which is defined as the minimum number of character deletions, insertions, or substitutions required to transform one string (word) into the other. (Note that the Hamming distance is a variant of the edit distance where only substitutions are allowed.) The edit distance and its generalizations/variants are widely used, for example, in approximate string matching and natural language processing [97, 131, 149] (e.g., spell checking/correction, speech recognition, spam filtering) and in computational biology [5, 63, 96] (e.g., to quantify the similarity of DNA sequences). Levenshtein introduced the edit distance (and observed that the edit distance function is a metric; in particular, it is symmetric and satisfies the triangle inequality) in his seminal paper [121], where he introduced a generalization of VT codes, namely, the Levenshtein code. The focus of Levenshtein's paper [121], as the title of his paper says, was actually on constructing the class of codes related to the edit distance, namely, codes capable of correcting deletions, insertions, and substitutions. Levenshtein [121] also observed that a code C can correct s deletions, insertions, or substitutions if and only if the edit distance between every two distinct codewords in C is greater than 2s.
Processing Social Media Images by Combining Human and Machine Computing during Crises
Published in International Journal of Human–Computer Interaction, 2018
Firoj Alam, Ferda Ofli, Muhammad Imran
State-of-the-art studies on duplicate image detection rely on Bag-of-Words (Wu, Ke, Isard, & Sun, 2009), entropy-based approaches (Dong, Wang, Charikar, & Li, 2012), perceptual hashing (Zauner, 2010), and deep features (An, Huang, Chen, & Weng, 2017). After feature extraction, most of these studies use hamming distance to compute the similarity between a pair of images. This requires defining a threshold to detect duplicate to near duplicate images. However, there has been less effort in literature on how to define this threshold. In this study, we have focused on finding a good approach to define this threshold while we also explored perceptual hashing and deep features.
Maximum Exact Matches for High Throughput Genome Subsequence Assembly
Published in IETE Journal of Research, 2022
Hamming Distance provides the number of base flips required to convert one genome to perfectly match with the other genome. This measure provides an exact view of the differences existing between the two genomes. The main objective of the proposed method is to construct a genome through minimal Hamming Distance. Hence reducing the Hamming Distance to zero is not possible. However, a minimal Hamming Distance indicates that the proposed approach has performed to its maximum efficiency by exactly aligning the sequences to a maximum possible extent.
Extracting process hierarchies by multi-sequence alignment adaptations
Published in Enterprise Information Systems, 2022
Hamming distance and edit distance are two of common syntactic methods to measure dissimilarity among two sequences of characters (Jagadeesh Chandra Bose and Van Der Aalst 2009b). Hamming distance, which is valid for two sequences of equal length, counts the number of character positions in which two input sequences differ. Edit distance between the sequences implies the least number of edit operations required to alter one sequence to another (Sung 2010).