Approximate string matching – Knowledge and References

Explore chapters and articles related to this topic

Signature Generation Algorithms for Polymorphic Worms

Published in Mohssen Mohammed, Al-Sakib Khan Pathan, Automatic Defense Against Zero-day Polymorphic Worms in Communication Networks, 2016

Mohssen Mohammed, Al-Sakib Khan Pathan

The problem of string matching is simply stated in Harris [7]. Given a body of text T[1 ... n], we try to find a pattern P[1 ... m], where m ≥ n. This can be used to search bodies of texts for specific patterns or, for example, in biology can be used to search strands of DNA (deoxyribonucleic acid) for specific sequences of genes. The issue of exact string matching has been extensively studied. However, approximate string matching is a much more complicated problem to solve that has many more real-world applications. The truth is that in real-world applications, the issue is not so systematic. This is where approximate string matching is needed. Instead of searching for the exact string, approximate string matching searches for patterns that are close to P. In other words, approximate string matching allows for a certain amount of error between the two strings being compared. One of the earliest applications of approximate string matching was in text searching. The approximate string-matching algorithms can be applied to account for errors in typing. Internet searching is particularly difficult because there is so much information, and much of it has errors in it. Also, since the Internet spans many different languages, errors frequently arise in comparing words across language barriers. Also, text editors have to use approximate string matching when performing spell-checks. In addition, spell-checkers have to generate a list of “suggested words” that are close in spelling to the misspelled word. Exact string matching is efficient to generate signatures for polymorphic worms.

Finding Experts in Community Question Answering System Using Trie String Matching Algorithm with Domain Knowledge

View Article

Journal Information

Published in IETE Journal of Research, 2023

R. Menaha, V. E. Jayanthi

In this work, the domain and the tags of the question are identified by using an exact string-matching algorithm. Usually, the string matching algorithm is categorized into two types: (i) Exact string matching and (ii) Approximate string matching. The exact string matching algorithm is utilized to find the occurrence of a defined string in a large string and the approximate string matching algorithm is used for finding the substring of a string. In this work, one of the exact string matching algorithms called trie data structure is utilized to find the occurrence of a question keyword in the domain-wise tags. The Trie data structure is used to store the data dictionary and algorithms for searching for the words from the dictionary and provide the list of valid words for suggestions that can be constructed. The trie data structure stores the key values in the form of a balanced BST (Binary Search Tree) and it is efficient for retrieving information. Here, all the domain tags are represented in the form of a trie data structure. As an example, a trie data structure for the language domain tags is illustrated in Figure 3. The characters of tags are represented as nodes in the trie tree as shown in Figure 3.