Text corpus – Knowledge and References

Explore chapters and articles related to this topic

Natural Language Processing

Published in Vishal Jain, Akash Tayal, Jaspreet Singh, Arun Solanki, Cognitive Computing Systems, 2021

V. Vishnuprabha, Lino Murali, Daleesha M. Viswanathan

For most of the NLP tasks, large bodies of text are used. Such text bodies are known as text corpus. It will be in a structured format. For social media analytics, a text corpus is created using text from social media and the web. For handling new problems, one can build their corpus. Some of the available text corpora include the following: Reuters—It is a collection of news documents. Gutenberg—Collection of books.IMDB Movie Review—A collection of movie reviews from the website imdb.com.Brown’s corpus—Large sample of English words.Stanford Question Answering Dataset (SQuAD).

Text Mining

View Chapter

Purchase Book

Published in Rakesh M. Verma, David J. Marchette, Cybersecurity Analytics, 2019

Rakesh M. Verma, David J. Marchette

Any text corpus requires preprocessing to clean out errors and unwanted characters and words. The specifics of the preprocessing is very corpus dependent and is dependent on the desired inference. For example, if the inference is classification, for example determining whether a document is on the topic of biology or physics, words such as “the”, “and”, “therefore”, etc. are “content free”, called stopwords in the text processing literature, and would be removed. However, if the topics were logic versus differential geometry, “and” may no longer be considered content free. In a corpus on elephants, the word “elephant” is essentially content free since it is ubiquitous, but the words “African” and “Asian”, although appearing in most documents, may be quite important. Further, for the purposes of determining authorship of a document, it is often precisely the standard stopwords that are useful.

Towards POI-based large-scale land use modeling: spatial scale, semantic granularity, and geographic context

View Article

Journal Information

Published in International Journal of Digital Earth, 2023

Junchuan Fan, Gautam Thakur

Table 3 listed the top-5 most similar terms for 12 example geographic features based on the embedding learning results for 3 different geographic regions. These geographic features serve different functions in people’s daily lives, including food, education, healthcare, entertainment, transportation, and public services, etc. As we can see from Table 3, the spatially explicit POI embeddings capture both the unique semantics and spatial context of geographic features for the specific geographic region. In comparison with word embedding generated from regular text corpus such as Google News, the spatially explicit embedding results reflect the unique geospatial semantics of geographic features. For instance, the most semantically similar terms for mall are shopping mall and shopping plaza based on embeddings trained from google news corpus, whereas the spatially explicit embedding for mall has different semantics for different geographic regions and contains information about the different types of stores and merchandises that share the same geographic contexts with a mall.

Probing the Past to Guide the Future IT Regulation Research: Topic Modeling and Co-word Analysis of SOX-IS Research

View Article

Journal Information

Published in Information Systems Management, 2022

George Mangalaraj, Anil Singh, Aakash Taneja

Dantu et al. (2020) performed LDA analysis on articles’ abstracts to reveal themes in healthcare Internet of Things research. We follow the same approach to perform LDA analysis to find themes in SOX-IS research. The abstracts were initially pre-processed in multiple steps using Python’s Natural Language Tool Kit (NLTK) package. First, we eliminated commonly occurring words (e.g., “the,” “are,” etc.) and words that were part of structured abstracts (e.g., “purpose,” “methodology,” etc.) using stop words. Second, we used parts of speech tags (verbs, adjectives, nouns, adverbs) to retain words. Third, words were lemmatized to arrive at the words’ root (lemma) form. Finally, we eliminated infrequently occurring words (appearing in fewer than three documents) and highly occurring words (appearing in 90% of the documents) to eliminate the effects of these extreme frequency words (Wang & Taylor, 2019) and created a text corpus for analysis. This text corpus was analyzed using the Gensim package, a Python library designed to process raw, unstructured texts,3 and MALLETT (Machine Learning Toolkit), a Java-based package for natural language processing and topic modeling.4

Prediction of user loyalty in mobile applications using deep contextualized word representations

View Article

Journal Information

Published in Journal of Information and Telecommunication, 2022

Zeynep Hilal Kilimci

Bidirectional encoder representations from transformers (BERT). BERT is known as the frontier of a new language representation, which stands for bidirectional encoder representations from transformers. Distinct from language representation methods, BERT is modelled to pretrain deep bidirectional representations from untagged text by alongside stipulating on both left and right context in all layers. That is, it is trained for the purpose of language understanding on a large text corpus and employed on natural language processing tasks. Moreover, BERT is the first model that internalizes the unsupervised learning approach and deeply bidirectional system. This means that the training procedure with the BERT model is implemented on the raw text data which is the novelty in the NLP task. Pretrained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models like Word2Vec or GloVe produce only a single ‘word embedding' representation for each term in the vocabulary, while contextual models instead produce a representation of each word that is based upon the other terms in the sentence (Devlin et al., 2018).