Stemming – Knowledge and References

Explore chapters and articles related to this topic

Natural Language Processing for Information Retrieval

Published in Anuradha D. Thakare, Shilpa Laddha, Ambika Pawar, Hybrid Intelligent Systems for Information Retrieval, 2023

Anuradha D. Thakare, Shilpa Laddha, Ambika Pawar

Stemming is an essential initial step in text data preprocessing, a necessary step in information retrieval, text data mining, and NLP. The purpose of stemming is to find a base/root word for standardization. If we remove suffixes from the word, it may cause errors like under-stemming and over-stemming. The input to the stemmer is tokens/words, and the output is root/base word. Stemming is also called removing affixes from words to get the original/root/base words. For example, “Singing” is a word if we remove its suffix, i.e., “ing,” then we get the original word “sing.” Further suffixes can be used to create new words from actual observations. Stemming is widely used in DM IR and NLP to reduce various word forms such as noun, adjective, verb, adverb, and so on to their basic/root word, reducing index file size. The stemming process has a substantial impact on the retrieval results for both rule-based and statistical approaches. Various stemming algorithms are available for multiple languages. Most stemming algorithms are based on a rule-based approach. These stem-mers outperform other popular techniques such as brute force.

Text Mining

View Chapter

Purchase Book

Published in Rakesh M. Verma, David J. Marchette, Cybersecurity Analytics, 2019

Rakesh M. Verma, David J. Marchette

Finally, the decision must be made whether to perform stemming. This is the process of reducing a word to its root: “walking” and “walked” both become “walk”. The upside of stemming is the reduction of the size of the lexicon, and the ability to use a single token for all cases of a given word. The downside is that no stemming algorithm is perfect, and errors will occur. See [282, 341, 350] and the many articles about stemmers for specific languages. Whether to use stemming for a particular task is, again, dependent on the corpus and the inference task. Generally speaking, the errors introduced through stemming are outweighed by the extra information resulting from mapping of versions of the same word to the root, but this is corpus/problem dependent.

Ideation

View Chapter

Purchase Book

Published in Walter R. Paczkowski, Deep Data Analytics for New Product Development, 2020

Walter R. Paczkowski

Stemming is sometimes described as the crude process of deleting typical endings (i.e., “ed”, “es”, “ies”, “tion”) with the goal of returning the stem. Another process, lemmatization, is not crude but sophisticated relying on dictionaries and parts-of-speech analysis to reduce words to their core which is called a lemma. Stemming, however, is the most commonly used. An algorithm called Porter’s Algorithm is the most common method for stemming. Stemming may be overdone: it would reduce the words “operator”, “operating”, “operates”, “operation”, “operative”, “operatives”, and “operator” to “oper-”. What would happen to “opera?” See Manning et al. [2008, p. 34] for a discussion.

KreolStem: A hybrid language-dependent stemmer for Kreol Morisien

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2023

Baby Gobin-Rahimbux, Ishwaree Maudhoo, Nuzhah Gooda Sahib

Stemming is a technique used to conflate the grammatical forms of a word to its correct root word (Puri et al., 2015). Stemmers can be used to improve the performance of retrieval-based system like chatbots and spellcheckers (Islam et al., 2007). Without the stemming process, it is hard for information retrieval systems to retrieve relevant information (Moral et al., 2014). This is because the documents are usually written in tural language and, therefore, the index file will contain many morphological variants of ana word. This may cause a mismatch in vocabulary between the user’s query and document terms (Karaa, 2013). Apart from information retrieval, stemmers are also useful in tural Language Processing (NLP) applications such as chatbots. Stemming increases the chance of matching query and document vocabulary (Das & Mitra, Na2011). AI-based chatbots also use stemmers to increase their accuracy. Just like for information retrieval systems, in chatbots, stemming increases the rate for the recognition of an intent in the intent file. Stemming is also used during the preprocessing step for document classification (Alhaj et al., 2019), (Almuzaini et al., Almuzaini & Azmi, 2020) and is known as a data reduction method (Boban et al., 2020).

Constructing automatic domain-specific sentiment lexicon using KNN search via terms discrimination vectors

View Article

Journal Information

Published in International Journal of Computers and Applications, 2019

Fahd Alqasemi, Amira Abdelwahab, Hatem Abdelkader

Seeds selection challenges are varied from asking if these seeds are existed in target corpus, to the frequency level of each chosen seed on the corpus, or asking if seeds are semantically enough reflected the domain you analyze. In Arabic language, the main practical problem we met is the different morphological forms of each seed term. Beside all of the above limits, there was an important NLP problem. It is the words spelling errors which corrupt the operation of matching these seeds in corpus. Likewise, the case of several morphological forms of the same stem terms. The latter problem is usually processed using stemming which helps in many situations but in sentiment seeds it is helpless. According to our knowledge, due to the nature of stem (even root), implied polarity differs sometimes in various words inflections. Also, stemming may not help in misspelling problem. Furthermore, terms position inside corpus documents is important in some terms discrimination techniques. Whereas stemming loses terms position within corpus documents by merging all same stem terms on a single stem [27].

Text-document clustering-based cause and effect analysis methodology for steel plant incident data

View Article

Journal Information

Published in International Journal of Injury Control and Safety Promotion, 2018

A. Verma, J. Maiti

Data pre-processing is needed to transform the text data into a format such that text-mining algorithms can be applied. The process of pre-processing starts with tokenization. Tokenization is to break an incident document into words (or terms). Most frequent words, but having very little value, is removed. These words are called stop words. After removing stop words, stemming is applied to find out the base word. Stemming refers to the crude heuristic process in which the last part of the words were chopped to remove inflectional ending of words correctly with the use of a vocabulary and connect these words to their stem or root. So the words ‘slip’, ‘slipping’, ‘slipped’ are treated the same as ‘slip’. Finally, the corpus of words/terms is generated. Dictionary-based stemmer is utilized in the study, having the benefit that all morphological changes are dealt by comparing with a reference dictionary. When a corpus term is unrecognizable, the stemmer applies few standard decision rules to provide the correct stem (Coussement, 2008). The corpus is then transformed into a term-document matrix for further analysis.