Natural Language Toolkit – Knowledge and References

Explore chapters and articles related to this topic

Reviews Analysis of Apple Store Applications Using Supervised Machine Learning

Published in Rashmi Agrawal, Marcin Paprzycki, Neha Gupta, Big Data, IoT, and Machine Learning, 2020

First, we take Rating 3, and Text Review 3 – last review by third user, for all categories and exclude any other attribute, by removing links, emojis, numbers, punctuation marks, commas and stop words. Stop words are common English words such as “the,” “am,” “their”, which do not influence the semantics of the review. Removing them can reduce noise and improve the accuracy of machine learning classifiers. We have applied Text processing and Text Normalisation for (Stop word removal, stemmer, and lemmatiser) using Natural Language Processing Tool (NLTK). NLTK is used for word tokenisation; POS (Part of Speech) – Tagging, Lemmatisation and Stemming. NLTK is a natural language toolkit designed for symbolic and statistical processing of datasets and language developed by the NLTK team. NLTK was created in 2001 as a part of the Computational Linguistic Department at the University of Pennsylvania.

Personality and emotion based cyberbullying detection on YouTube using ensemble classifiers

View Article

Journal Information

Published in Behaviour & Information Technology, 2022

Vimala Balakrishnan, See Kiat Ng

Common steps with regards to Natural Language Processing (NLP) were performed prior to the feature extraction process, a step that is deemed necessary to prepare the unstructured data for classification purposes. This is especially required when dealing with social media data that tend to contain a lot of ‘noise’ including emoji, emoticons, abbreviations and slangs/dialects, among others. Examples of processes performed in this stage include transformation (i.e. uppercases converted to lowercases, removal of hashtags, hyperlinks, punctuations and stop words, etc.). The conversion to lowercases was done to enable easy interpretation for the machine whereas the removal of hashtags, hyperlinks, etc. was done to reduce the ‘noise’. As described later, the text will be tokenised into individual tokens (i.e. words) and thus, punctuations carry no meaning. Also, stop words such as ‘a’, ‘and’, etc. commonly appear in English text and do not provide valuable information. This step is then followed by lemmatisation that reduces a word to its root (e.g. crying, cries and cried will be represented by cry), and tokenisation (extraction of each individual word). The pre-processing steps above are consistent with those adopted in Dadvar et al. (2014) and Dadvar and Eckert (2020) who used the same dataset, and other NLP task-related studies in cyberbullying (Al-Garadi, Varathan, and Ravana 2016; Bozyiğit, Utku, and Nasibov 2021) and other issues including fake news (Elhadad, Li, and Gebali 2020; Khan et al. 2021). All of these NLP tasks were accomplished through the use of Python-based Natural Language Toolkit (NLTK).1

Creating research topic map for NIMS SAMURAI database using natural language processing approach

View Article

Journal Information

Published in Science and Technology of Advanced Materials: Methods, 2021

Sae Dieb, Kou Amano, Kosuke Tanabe, Daitetsu Sato, Masashi Ishii, Mikiko Tanifuji

We used the XML tag name for each section provided by the publisher to extract the section. Because the main body of the publication is not collected, the effect of including the ‘Keywords’ section is balanced. The text of each section is segmented into tokens using the NLTK word tokenizer. Natural Language Toolkit (NLTK) is an open-source Python package with data sets, supporting research and development in Natural Language Processing [23]. The following processing steps are then conducted on the tokenized texts to remove noisy data: □ Removing numeric values, punctuation marks (for example, “23.5”, “!”, “?”). Such data are not related to the topics discussed in the publications. Even though numerical values are important in some “grey area” topics such as when reporting catalysts with “high yield”, however, it is beyond the scope of this study to tackle these issues.□ Filtering general English language stop-words such as “but”, “an”, “he”. These stop-words frequently occurring in English but do not carry a thematic component or significance themselves [24].□ Physical units such as “m” (meter) for length measurement, and “K” (Kelvin) for temperature measurement, among others, are frequently found in materials science research publications; however, for our objective in this study, they are not informative regarding the research output. The list was compiled using SI-based and derived units [25].

A multi-objective PSO approach of mining association rules for affective design based on online customer reviews

View Article

Journal Information

Published in Journal of Engineering Design, 2018

Huimin Jiang, C. K. Kwong, W. Y. Park, K. M. Yu

A number of methods and tools are available to conduct a sentiment analysis, such as Python NLTK (Natural Language Toolkit), R (text mining module), RapidMiner, Semantria, Lingpipe, and LIWC 2007 (Linguistic Inquiry and Word Count). In this study, Semantria was chosen to conduct sentiment analysis because of its popularity as a well-known text analysis software tool. Semantria provides Excel add-in that enables the analysis of Excel spreadsheets according to positive, neutral, and negative sentiments. The Semantria Excel add-in conducts an automated sentiment analysis to extract sentiment from online reviews similar to human processing behaviour, which contains the above five processes.