Lemmatisation – Knowledge and References

Explore chapters and articles related to this topic

Big data text mining in the financial sector

Published in Noura Metawa, Mohamed Elhoseny, Aboul Ella Hassanien, M. Kabir Hassan, Expert Systems in Finance, 2019

Mirjana Pejić Bach, Živko Krstić, Sanja Seljan

According to Eler et al. (2018), pre-processing steps through various methods have strong impact on text-mining techniques. Through lowercasing the whole text, all tokens are converted into lowercase, where some mistakes can happen (e.g. converting of abbreviation “US” into the pronoun “us”). To reduce the noise in the text, there are various techniques like deletion of double spaces, numbers, names (if needed), punctuation, rare words, stop-words and so forth. The following step to reduce dimensionality is to introduce stemming or lemmatization tasks on keywords in order to gather all variations of specific keywords (example: “bank”, “banking”, “banks” → “bank”). Lemmatization uses PoS (Part-of-Speech) tagging to identify grammatical categories. This feature can be useful in the parsing algorithms to detect the correct POS word or to extract the sequence of words (n-grams). Many text-mining tools use stemming which uses cutting of affixes: “banking” → “bank” + “ing”.

Natural Language Processing (NLP) Methods for Cognitive IoT Systems

View Chapter

Purchase Book

Published in Pethuru Raj, Anupama C. Raman, Harihara Subramanian, Cognitive Internet of Things, 2022

Pethuru Raj, Anupama C. Raman, Harihara Subramanian

This is for reducing a word to its base form, also grouping its different forms. For example, verbs in the past tense to the present (e.g. “stand” is changed to “stood”), and synonyms are unified (e.g. “best” is changed to “good”). Standardizing words with the same meaning to their root is challenging. Although it seems closely related to the stemming process, lemmatization uses a different approach to reach the root forms of words. Resolve words to their original dictionary form (known as lemma) is lemmatization. So it does require detailed dictionaries for the algorithm to look into and link the corresponding words to their related lemmas. For example, “running”, “runs”, and “ran” are all forms of the word “run”, so “run” is the lemma of all the words.

Machine Learning Based Hospital-Acquired Infection Control System

View Chapter

Purchase Book

Published in Shampa Sen, Leonid Datta, Sayak Mitra, Machine Learning and IoT, 2018

Sehaj Sharma, Prajit Kumar Datta, Gaurav Bansal

Since the data is unstructured (medico notes), we have to introduce some incipient terms here: Term frequency (TF). In this method (TF 1000), the 1000 most frequent terms, were culled predicated on their TF. TF refers to the simplest weighting scheme, where the weight of a term is equivalent to the number of times the term occurs in a document [22,23].Lemmatization and stemming. Lemmatization describes the process of reducing a word to a mundane base form, customarily its dictionary form (lemma). For instance, hospitals, hospital's → hospital. Lemmatizers[in]: having[out]: haveIn a simplistic way, stemming can be considered as a process to remove the ending part of some words like “ing.” Stemmers[in]: having[out]: havStop word removal. Stop words are terms regarded as not conveying any significant semantics to the texts orPhrases. For example, the words like “The, a, an, etc.” can be removed during stop word removal.Infection-specific terms. As done in research [29], medical experts help is normally sought to create a bag of words associated with the presence of NI in patients records. For example, catheter, ultrasound, surgery, fever have high chances to be present in patient's record in a case of NI. An automatic synonym extractor can be used, but it requires supervision from a domain expert. In this case, a final list of 374 terms (words) was developed.Term frequency–inverse document frequency (TF-IDF). TF-IDF, is a method to understand the importance of a word in a document against a corpus. The TF-IDF value increases proportionally with the number of times a word appears in a document, but is often offset by the frequency of the word in the corpus. This is extremely important to understand that some words appear very frequently in written or spoken communication.

Newsgroup topic extraction using term-cluster weighting and Pillar K-Means clustering

View Article

Journal Information

Published in International Journal of Computers and Applications, 2022

Sigit Adinugroho, Randy C. Wihandika, Putra P. Adikara

The same word in a document may have different forms due to inflection. It is crucial to extract the root of words, since different representations of a word are likely to have similar meaning. There are two ways to retrieve root of words: stemming and lemmatization. Stemming is a rule-based method that chops off prefixes or suffixes from a word, while lemmatization involves morphological analysis with the help of a dictionary. As evaluated in the previous study [15], both stemming and lemmatization improve the performance of clustering. They also reduce the number of words extracted from a text. Although a stemmer reduces more words and achieves better clustering performance than a lemmatizer, there is a property of a stemmer that makes it inappropriate for topic extraction. A stemmer produces a stem, which may not be a valid word. In contrast, a lemmatizer always produces a valid word called lemma. Since a topic requires an accurate word, a lemmatizer is more appropriate for topic extraction. In this work, lemmatization process is handled by the built-in morphy function in WordNet [16].

Identifying disaster related social media for rapid response: a visual-textual fused CNN architecture

View Article

Journal Information

Published in International Journal of Digital Earth, 2020

Xiao Huang, Zhenlong Li, Cuizhen Wang, Huan Ning

Tweeted texts are noisy and messy, and therefore, a textual pre-processing is necessary to trim and formalize the inputs before feeding to the Word2Vec and word embedded CNN. During the pre-processing, we removed the punctuation marks, emoticons, and numbers from the text. Stemming and lemmatization techniques were also applied in the process. Stemming identifies the common root form of a word by removing or replacing word suffixes (e.g. ‘flooding’ is stemmed as ‘flood’), while lemmatization identifies the inflected forms of a word and returns its base form (e.g. ‘better’ is lemmatized as ‘good’). For tweets that contain URL, regular expression is used to match and remove URLs in their texts. Stopwords represent the most common words in a language, hardly contributing to the meaning of a sentence. In this pre-processing step, a list of stopwords were retrieved from Natural Language Toolkit (NLTK) library (http://www.nltk.org/) and words in the list are further removed. We also applied some basic transformations, such as ‘’ve’ to ‘have’, ‘’ll’ to ‘will’, ‘n’t’ to ‘not’, ‘’re’ to ‘are’, to enhance the comprehension of the algorithm.

A review of approaches for topic detection in Twitter

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2021

Zeynab Mottaghinia, Mohammad-Reza Feizi-Derakhshi, Leili Farzinvash, Pedram Salehpour

Stemming is the computational process of reducing all words to their root (or stem) and is done usually by stripping each word of its suffix and derivation (Lovins, 1968), there are some stemming algorithms such as Porter stemming (Porter, 2006) and Landcaster algorithms (Hooper & Paice, 2005). Lemmatization is the process of ﬁnding the lemma, or the normalization of words such as reduce running to its base form run (Korenius et al., 2004). Stemming is an algorithmic method, while lemmatization is based on lexical analysis.