Corpus – Knowledge and References

Explore chapters and articles related to this topic

Speaking Naturally: Text and Natural Language Processing

Published in Jesús Rogel-Salazar, Advanced Data Science and Analytics with Python, 2020

Welcome to the world of regular expressions! As the name implies, a regular expression is an utterance (typically in text form) that appears in a corpus with certain frequency or regularity. Recognising those patterns in the corpus relies on determining the characters that make up the expression, including letters, digits, punctuation and any other symbol, including special characters and even in other scripts, such as the Japanese sentence in the margin, as well as many others like Chinese, Arabic or Devanagari for example. A corpus is a large and structured set of texts upon which linguistic analysis can be performed.

Pre-Processing of Dogri Text Corpus

View Chapter

Purchase Book

Published in Durgesh Kumar Mishra, Nilanjan Dey, Bharat Singh Deora, Amit Joshi, ICT for Competitive Strategies, 2020

Sonam Gandotra, Bhavna Arora

Corpus is the collection of large amounts of data which can either be in written or spoken form. A standard corpus is required for carrying out any NLP task. Dogri language being new to the area of NLP, no such linguistic resources are there for help to the researchers. Thus, creation of corpus and pre-processing of the corpus is taken up in this research. In pre-processing, the task of stop-word list generation is the foremost requirement for any NLP task and the corpus so chosen must be diverse in nature, so that results could be evaluated efficiently. It is the prime need for natural language processing tasks, so this paper proposes to build the corpus for the same and apply the pre-processing techniques to it.

Role of Computational Intelligence in Natural Language Processing

View Chapter

Purchase Book

Published in Brojo Kishore Mishra, Raghvendra Kumar, Natural Language Processing in Artificial Intelligence, 2020

Bishwa Ranjan Das, Brojo Kishore Mishra

The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples (a corpus (plural, “corpora”) is a set of documents, possibly with human or computer annotations).

Classification of Customer Reviews Using Machine Learning Algorithms

View Article

Journal Information

Published in Applied Artificial Intelligence, 2021

Behrooz Noori

Existing approaches to sentiment analytics can be classified into two broad categories: semantic orientation approaches and machine learning approaches (Fu et al. 2018). Semantic orientation approaches hold that text is classified into affect categories on the basis of the presence of fairly unambiguous affect words, such as “happy,” “sad,” “afraid,” and “bored.” Semantic orientation approaches are popular thanks to their accessibility and economy. However, the weaknesses of these approaches include poor affect recognition given complex linguistic rules, and heavy dependence on the depth and breadth of the employed lexicon resources (Fu et al. 2018). For a domain lacking of such resources, machine learning approaches can mitigate the above limitations. By feeding a machine learning algorithm a training corpus of affectively annotated texts, machine learning approaches can not only learn the affective polarity of affect keywords but can also consider the polarity of other arbitrary keywords and word co-occurrence frequencies (Cambria 2017). However, machine learning approaches rely on statistical models that are meaningful when given a sufficiently large text input; therefore, the approaches can achieve better performance on the document or paragraph level compared to smaller text units, such as sentences or clauses (Fu et al. 2018).

Automated trading systems statistical and machine learning methods and hardware implementation: a survey

View Article

Journal Information

Published in Enterprise Information Systems, 2019

Boming Huang, Yuxiang Huan, Li Da Xu, Lirong Zheng, Zhuo Zou

In this framework, both market and textual data are used as input. At the output end, decision-making is fulfilled by a machine learning algorithm. The researched texts bases are derived from blogs, forums, news, social media and corporation filings. Before qualitative information can be used, the framework performs feature selection and feature representation to transform the text into a readable format that indicates a certain category of sentiment. An appropriate categorization methodology is crucial for acquiring sufficiently correct outputs. The most commonly used method is the dictionary-based method, whose baseline approach is known as the ‘bag-of-words’ algorithm, in which each single word of a given text is classified regardless of the order based on predesigned dictionary categories (weights) (Manning and Sch¨Utze et al. 1999). Further developed approaches, such as noun phrases, named entities and n-grams, are harnessed to categorize special patterns of word combinations or word sequences (Kazemian, Zhao, and Penn 2014; Schumaker and Chen 2009). Several studies have used machine learning to score the nature of the text, and supervised learning, i.e. SVM and Naive Bayes (NB), has been used to train predictive models based on a specially designated training corpus in which each word or phrase and the whole sentiment of articles were manually classified. Although the introduction of machine learning programs did improve the results, these programs were time consuming (Kearney and Liu 2014). However, the general accuracy presented in studies of financial qualitative information classification is not satisfying. As alluded to by Hagenau, Liebmann, and Neumann (2013), this research area is still in its infancy, and few studies have demonstrating correct ratios higher than 70%, with most under 60%, which is slightly above the probability of tossing a coin. Such low accuracy should inspire a greater focus on risk prevention in real practice, and it explains why pure text-based systems are rarely used in practice. An interesting exception is the research submitted by Preis, Moat, and Stanley (2013), who only used the search volume of a single word in Google Trends and achieved a remarkable return from 2004 to 2011.

Identifying urban functional zones by capturing multi-spatial distribution patterns of points of interest

View Article

Journal Information

Published in International Journal of Digital Earth, 2022

Quan Qin, Shishuo Xu, Mingyi Du, Songnian Li

A variety of methods have been used in the existing POI-based studies to identify UFZs, such as predefining rules (Song et al. 2018) and extracting low-level features (e.g. frequencies) of POIs to roughly identify UFZs using machine learning methods (Jiang et al. 2015; Hu et al. 2016; Gong et al. 2020; Tu et al. 2020; Zong et al. 2020). These methods missed the relation between POI data and regional socio-economic characteristics. Natural Language Processing (NLP) methods are promising solutions to understand the potential relation between POI data and socio-economic characteristics from a textual description perspective and have been increasingly used for UFZ identification (Chen, Xu, and Gong 2021). Corpus generally refers to a substantial collection of organized texts in the NLP domain (Ng and Zelle 1997). Correspondingly, geo-corpus is constructed with geospatial data (e.g. POI data used in this work) under specific sampling strategies within a region. As such, NLP methods can effectively paraphrase the potential relation between POI data and regional socio-economic characteristics in a geo-corpus (i.e. urban functions of a region), which is similar to understanding the relation between words and sentences or paragraphs in a corpus. Earlier studies mainly tried Term Frequency-Inverse Document Frequency (TF-IDF) (Yuan, Zheng, and Xie 2012; Yuan et al. 2015; Qian et al. 2021), Latent Dirichlet Allocation (LDA) (Yuan, Zheng, and Xie 2012; Yuan et al. 2015; Chen, Huang, and Xu 2017; Xing and Meng 2018; Chang et al. 2020a), and other topic models to infer regional functional semantics and achieved higher accuracy compared to the aforementioned methods. However, the adapted Bag-of-Words (BOW)-based geo-corpuses constructed by those NLP methods are unordered, which lack the sequential and contextual information. The unordered means that there is no order or sequence among the words in the BOW-based geo-corpus, which is similar to the unordered things in a bag. In contrast, the sequential and contextual information refers to the order and sequence of words within documents. Since different sequences and contexts of words result in the different meanings of the words and even the whole document, the sequences of POIs in a geo-corpus affects understanding the urban functions of the zone to which the above-mentioned geo-corpus corresponds. In other words, they ignore the spatial relation and spatial interaction of POIs, which is helpful to capture the spatial heterogeneity of UFZ semantics and plays an important role in accurately identifying UFZs. For instance, it is difficult for the topic model to distinguish two categories of UFZs with similar POI statistical frequency features but different spatial distributions. As such, it is important to take the sequential and contextual information of POIs into consideration for UFZ identification.