Levenshtein distance – Knowledge and References

Explore chapters and articles related to this topic

Natural Language Processing (NLP) Methods for Cognitive IoT Systems

Published in Pethuru Raj, Anupama C. Raman, Harihara Subramanian, Cognitive Internet of Things, 2022

Pethuru Raj, Anupama C. Raman, Harihara Subramanian

One of the simple and easily usable metrics is edit distance (also known as Levenshtein distance). Levenshtein distance is an algorithm for estimating the (cosine) similarity of two string values (word, words form, words composition) by comparing the minimum number of operations to convert one value into another. Below are the popular NLP applications for edit distanceautomatic spell-checking (correction) systems;in bioinformatics – for quantifying the similarity of DNA sequences (letters view);text-processing – define all the proximity of words that are near to some text objects.Cosine similarity is a metric used for text similarity measuring in various documents. Calculations for this metric is based on the measures of the vector’s similarity by the well-known cosine vectors formula:

Natural Language Processing

View Chapter

Purchase Book

Published in Vishal Jain, Akash Tayal, Jaspreet Singh, Arun Solanki, Cognitive Computing Systems, 2021

V. Vishnuprabha, Lino Murali, Daleesha M. Viswanathan

The typical applications of NLP include the following. Question answering: It is a system that answers to the question automatically. It is an information retrieval system.Machine translation: It is one of the oldest but beneficial applications of NLP. It automatically translates text from one language to another by considering the syntax, semantics, etc., of both languages.Text summarization: It takes a document/piece of text as input and generates a compressed form of text packing essential content without any change in meaning.Optical character recognition: Extracts text out from an image embedded with text.Text similarity and clustering of documents: Finding similar texts helps in building relationships quickly. Consider (Man, Woman), (Boy, Girl) pairs; these words are not the same, but they have some similarities. These types of relationships are identified by finding text similarity between two words. Levenshtein distance: The similarity of two strings is calculated by accounting the total number of editing operations (insertion, deletion, replace) required to convert one string into another. The Levenshtein distance of the pair (Hook, Hack) is 2. The second letter “o” should be replaced with “a” and the next “o” is replaced with “c” to get the string “Hack.” Here, two substitution operations were required to convert “hook” to “hack.” Therefore, the Levenshtein distance is 2.Cosine similarity: After converting the text to vectors, the similarity of vectors can be found out using the cosine similarity measure.Phonetic similarity: The voice-to-text converter applications use a phonetic matching concept. It tries to find a matched word from the dictionary that is phonetically similar. •

Seeing through a new lens: exploring the potential of city walking tour videos for urban analytics

View Article

Journal Information

Published in International Journal of Digital Earth, 2023

Maximilian C. Hartmann, Ross S. Purves

YouTube videos are not geolocated at fine granularities. Rather, YouTube allows coarse location tagging of one tag per video at roughly city-level. To extract more detailed spatial information for our purpose we used two complementary approaches in parallel. The first approach used the text artefacts and their bounding boxes recorded in the OCR output log. We first clustered artefacts based on their bounding boxes using HDBSCAN (McInnes, Healy, and Astels 2017) to combine text spanning multiple lines, as is often the case on street signs. To reduce the number of false positives, we then only considered artefacts with ten or more characters. These were then matched to our OSM gazetteer using the Levenshtein distance metric (step 3.1). Levenshtein distance (Miller, Vandome, and McBrewster 2009) is a fuzzy string matching metric which allows the calculation of string similarity (or difference). We retained all strings with a Levenshtein distance of two or less. Our second approach to geolocation leveraged the video timestamps and associated placenames found in the video metadata (step 2.3). Similarly to the first approach we matched them to our gazetteer using Levenshtein distances, but this time with a lower threshold of one or less which mostly accounts for the lack of use of, for example, accents (step 3.2). As a result, these two approaches generated for each video a list of locations with linkage to our OSM gazetteer.

An Enhanced RBMT: When RBMT Outperforms Modern Data-Driven Translators

View Article

Journal Information

Published in IETE Technical Review, 2022

Md. Adnanul Islam, Md. Saidul Hoque Anik, A. B. M. Alim Al Islam

Therefore, we propose an approach for translating such polymorphic verbs efficiently and memory optimization. In this approach, a verb is translated using a hash table. The key-value pair in the hash table comprises the standard form of verbs of source and target languages. To translate effectively, we should detect these standard forms of verbs from their non-standard forms. To accomplish this, we employ an altered version of a widely used string similarity measurement algorithm [28]: the “Levenshtein distance” algorithm [8,29]. Levenshtein distance is a measurement to determine the minimum edits required for converting a source string into a target string. Here, the most common edits include: Insertion (of letters in the source string)Deletion (of letters from the source string)Substitution (of letters with others in the source string)

Integration of sketch maps in community mapping activities

View Article

Journal Information

Published in Spatial Cognition & Computation, 2021

Ali Zare Zardiny, Farshad Hakimpour

Each route in the graph derived from the sketch maps consists of a set of junctions. Accordingly, if the similarity of two junctions in two graphs in a qualitative space can be measured, then the similarity of two routes are calculated based on the similarities between their related junctions. With this explanation, the set of parameters examined here for matching are all defined on the basis of the junction specification. These parameters are: Descriptive similarity: The first parameter used to compare two junctions is their names. In the graph, a junction is formed either by two intersecting routes, so a combination of the related routes named can be considered as the junction name, or is a virtual vertex equivalent to a POI which gets the POI name. Here, the Levenshtein distance is used to measure the similarity of the names for two Junctions (Levenshtein, 1966). The Levenshtein distance is the number of changes (insertions, deletions, or substitutions) required to transform between two strings. Using this distance, the descriptive similarity of two junctions, and in two sketch maps can be calculated in accordance with Equation 1 (Zare Zardiny et al., 2020):