Edit distance – Knowledge and References

Explore chapters and articles related to this topic

Natural Language Processing (NLP) Methods for Cognitive IoT Systems

Published in Pethuru Raj, Anupama C. Raman, Harihara Subramanian, Cognitive Internet of Things, 2022

Pethuru Raj, Anupama C. Raman, Harihara Subramanian

One of the simple and easily usable metrics is edit distance (also known as Levenshtein distance). Levenshtein distance is an algorithm for estimating the (cosine) similarity of two string values (word, words form, words composition) by comparing the minimum number of operations to convert one value into another. Below are the popular NLP applications for edit distanceautomatic spell-checking (correction) systems;in bioinformatics – for quantifying the similarity of DNA sequences (letters view);text-processing – define all the proximity of words that are near to some text objects.Cosine similarity is a metric used for text similarity measuring in various documents. Calculations for this metric is based on the measures of the vector’s similarity by the well-known cosine vectors formula:

Record Linkage

View Chapter

Purchase Book

Published in Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane, Big Data and Social Science, 2020

Tokle Joshua, Bender Stefan

Comparing fields whose values are continuous is straightforward: often one can simply take the absolute difference as the comparison value. Comparing character fields in a rigorous way is more complicated. For this purpose, different mathematical definitions of the distance between two character fields have been defined. Edit distance, for example, is defined as the minimum number of edit operations—chosen from a set of allowed operations—needed to convert one string to another. When the set of allowed edit operations is single-character insertions, deletions, and substitutions, the corresponding edit distance is also known as the Levenshtein distance. When transposition of adjacent characters is allowed in addition to those operations, the corresponding edit distance is called the Levenshtein–Damerau distance.

Smartphone-Based Human Activity Recognition

View Chapter

Purchase Book

Published in Yufeng Wang, Athanasios V. Vasilakos, Qun Jin, Hongbo Zhu, Device-to-Device based Proximity Service, 2017

Yufeng Wang, Athanasios V. Vasilakos, Qun Jin, Hongbo Zhu

Once signals have been mapped to strings, exact or approximate matching and edit distances are key techniques used to evaluate string similarity and thus either find known patterns or classify the user activity. Some typical metrics of evaluating string similarity are as follows: Euclidian-related distances between symbols are defined by the corresponding numeric distance between the signal values that correspond to each symbol in the string representation.Levenshtein edit distance determines the minimum number of symbol insertions, deletion, and substitutions needed to transform one string into the other.Dynamic time warping (DTW) is a metric for measuring similarity between two sequences that may vary in length and can thus correspond to different time basis. It can capture similarities of strings with distinct sampling period, but has a relatively high computational cost.

What do riders say and where? The detection and analysis of eyewitness transit tweets

View Article

Journal Information

Published in Journal of Intelligent Transportation Systems, 2023

O. Kabbani, W. Klumpenhouwer, T. El-Diraby, A. Shalaby

The noun phrases are matched against a database of station names and coordinates. The dataset was expanded with colloquial names as tweets often contain informal terms. For example, in Calgary these colloquialisms include the use of Vic park for Victoria Park/Stampede, landmarks served by light rail such as the Saddledome arena for Victoria Park/Stampede, or variations in spellings of station names such as Saddletown and Saddletowne. Typing errors and other variations were accounted for by matching the terms using the Levenshtein distance instead of exact matches as shown in Figure 6. The Levenshtein distance is a common method to evaluate edit distance and is defined as the number of single character edits needed to change a word into another (Levenshtein, 1966). The Levenshtein distance was calculated using the edit_distance method from NLTK. The threshold was set to 1 for noun phrases longer than 5 letters, whereas short terms require an exact match. Additionally, terms with numbers require an exact match as a single edit would change the term entirely, for example “1 street” would match with “3 street.” For that reason, the database included different variants of station names with numbers. If a match is obtained, the database returns the name and coordinates of that location.

Silent failure detection in partial automation as a function of visual attentiveness

View Article

Journal Information

Published in Traffic Injury Prevention, 2023

Chris Schwarz, John Gaspar, Cher Carney, Pujitha Gunaratne

Scan-path sequences display almost infinite variability, even when windowed down, yet differences and commonalities can be used to cluster gaze behavior into a small number of distinct types. Hierarchical agglomerative clustering is an attractive method in that the number of clusters can easily be selected and changed after the computation is performed. However, clustering requires a distance (or dissimilarity) metric which is not obviously available for scan-path sequences. Considering AOI gaze zones as letters in an alphabet suggests the use of an edit distance as such a metric. The Levenshtein distance, also known as optimal matching edit distance (Navarro 2001), represents the minimal cost of transforming one sequence into another by insertion, deletion, and substitution operations.

Toward Trustworthy and Comfortable Lane Keeping Assistance System: An Empirical Study of the Level of Haptic Authority

View Article

Journal Information

Published in International Journal of Human–Computer Interaction, 2021

Kyudong Park, Sung H. Han, Jiyoung Kwahk

As objective measures of driving performance and behavior, five measures were analyzed using the trajectory log data: Standard deviation of lane position (SDLP), Steering reversal rate (SRR), Root mean square of lateral speed (RMSLS), typing error (edit distance), and stress. SDLP is estimated as the standard deviation of the lateral position from the lane center. If SDLP is high, it is interpreted that the lateral driving is not stable. SRR is measured by the frequency of steering wheel reversals (corrections) larger than a pre-defined angle (Östlund et al., 2005). The steering angle change of more than 1 degree per second was measured. RMSLS is estimated as the root mean square of lateral speed. The lower this value, the better the driving performance. To analyze the performance of the secondary task, we measured typing error using the edit distance of two strings (He et al., 2014; Levenshtein, 1966). The edit distance is the number of discrete steps required to make the strings identical, including insertions, deletions, switching, and substitutions. Larger edit distance values indicate a greater difference between the two strings. To analyze the driver’s workload when performing the secondary task, the stress level was collected on a 100 points scale through the EEG device and Cortex API (n.d.) providing real-time detection of cognitive and emotional states.