Bag of words – Knowledge and References

Explore chapters and articles related to this topic

Management Rule Mining Computing

Published in Parveen Berwal, Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Shashi Kant, Computer Applications in Engineering and Management, 2022

Parveen Berwal, Jagjit Singh Dhatterwal, Kuldeep Singh Kaswan, Shashi Kant

As the numbers of printed communications accessible on the Internet expand every day, efficient text retrieving or screening have become vital in the organization and administration of numerous information jobs. Text classification is an increasingly important method to handle this huge volume of data. Data augmentation (FS) is usually used to minimize the complexity of data sets containing thousands of functions that could not be further processed. The classification of text is one of the issues in which FS is vital. The computational complexity of the feature area is a big challenge with the categorization; hence FS is the primary step in the categorization of texts. For optimization and productivity, FS is critical since it not only limits the dimensionality of the input set found but reduces the potential biases hidden in the original interpersonal communication. FS often uses selected measures for function words (TF), TF = IDF for log (N/DF), while n is the number of words in a document, and DF is the number of processes including the focus term. The objective of the FS approaches is to reduce the data size by deleting characteristics that are regarded as unimportant for categorization. In this study, we offer a new FS algorithm named HCSGA. The proposed approach is used for the text characteristics of the bag of words model whereby each location in the entry vector correlates to a particular term in the actual document, and a document is viewed as a set of words or phrases [50].

A Study of Proximity of Domains for Text Categorization

View Chapter

Purchase Book

Published in Sk Md Obaidullah, KC Santosh, Teresa Gonçalves, Nibaran Das, Kaushik Roy, Document Processing Using Machine Learning, 2019

Ankita Dhar, Niladri Sekhar Dash, Kaushik Roy

There are a number of techniques used in text categorization. One of the most common and widely used approaches is called a ‘bag of words’. This method is basically effortless to represent and accomplish, yet in NLP a simple method takes a long time to produce an encouraging outcome. The approach is also flexible enough to be used in innumerable procedures for extracting features from texts. The bag of words can be described as the occurrence of words within a text document. It is a procedure of expressing textual knowledge with machine learning algorithms. In natural language processing, the vectors are developed from text data that represent different linguistic characteristics of the texts. The approach only considers whether the particular term is present in the text document, not where in that document. In this approach, a histogram of the tokens within the text is considered a feature. The concept is that a text document is said to be similar to another text document if both have similar content. The bag-of-words model can be very simple or complex depending on the problem. The complex scenario arises while determining the design of the vocabulary of the tokens and the procedures of scoring the presence of those tokens.

Performance of serial and parallel processing on online sentiment analysis of fast food restaurants

View Chapter

Purchase Book

Published in Shin-ya Nishizaki, Masayuki Numao, Jaime Caro, Merlin Teodosia Suarez, Theory and Practice of Computation, 2019

B. Quijano, M.R. Nabus, L.L. Figueroa

Analysis of people’s sentiments and opinions was not an active area of research until the early 2000s. The surge of active research on sentiment analysis (Nasukawa & Yi, 2003), also known as opinion mining (Dave et al. 2003), was due to the massive increase of opinionated data coming from the World Wide Web in particular; examples of these opinionated data are online user reviews of service institutions such as hotels, restaurants, and travel agencies (Pang & Lee, 2008). While processing natural language has extensively made use of non-probabilistic models such as Chomsky Normal Form (Chomsky, 1959) in the past, statistical and probabilistic models such as the bag-of-words model have made a resurgence due to the increased processing power of computers that can be used to handle huge amounts of data (Lee, 2016). The bag-of-words model is the representation of words in a text document as a set of words and their frequencies, regardless of the ordering of the words in that document (Brownlee, 2017). Document classification methods such as the Naive Bayes classification makes use of this model for applications such as spam filtering (Sahami et al. 1998).

What we talk about when we talk about EEMs: using text mining and topic modeling to understand building energy efficiency measures (1836-RP)

View Article

Journal Information

Published in Science and Technology for the Built Environment, 2023

Apoorv Khanuja, Amanda L. Webb

After tokenization, the stop words were removed from this bag of words using the R package stopwords (Benoit, Muhr, and Watanabe 2021). Stop words are frequently occurring but un-informative words (e.g., and, or, to, the) and are often removed from textual data prior to text mining. The list of stop words used for this analysis came from the snowball lexicon within the stopwords package, which was selected because its relatively short list of stop words would retain most of the EEM text. In addition to removing stop words, the tokens in which the first character was a number were also removed. This was because these tokens generally provided unnecessary level of detail (e.g., specific temperature setpoints, COP values, or the name of a standard such as ASHRAE 62.1) that was not essential to describing the EEM. However, tokens that contained numbers but started with an alphanumeric letter (e.g., T8, T12, CO2, etc.) were not removed since they provided useful information regarding the specific type of building component affected by an EEM.

An Optimized Crossover Framework for Social Media Sentiment Analysis

View Article

Journal Information

Published in Cybernetics and Systems, 2022

Surender Singh Samant, Vijay Singh, Arun Chauhan, Jagadish Dasarahalli Narasimaiah

The Bag of Words (BoW; Chen, Yap, and Chau 2011) model is the simplest form of text representation in numbers. The Bag-of-Words counts the total occurrence of the most frequently utilized words in the document. However, the Bag-of-Words suffers from the drawbacks like: “(a) If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too and (b) Additionally, the vectors would also contain many 0 s, thereby resulting in a sparse matrix (which is what we would like to avoid)”. In this proposed study, a proposed Bigram-BoW (B-BoW) model is presented. The Bigram-BoW (B-BoW) makes the list of all the words in Then, score the words in each document based on their frequency using the bigram (instead of unigram that is being followed in standard BoW).The frequency of each word in is computed, and the positive weight is computed as per Eq. (1).

Focused domain contextual AI chatbot framework for resource poor languages

View Article

Journal Information

Published in Journal of Information and Telecommunication, 2019

Anirudha Paul, Asiful Haque Latif, Foysal Amin Adnan, Rashedur M Rahman

SVM is one of the most popular text categorization methods based on Structural Risk minimization principle which was introduced by Vapnik (Burges, 1998). We made a bag of words which is a vector representation of specific word occurrence (Zhang, Jin, & Zhou, 2010). The target was to associate unique words with unique intent by considering the occurrence of each word in a certain classification. So, the input was a vector of numbers; where zero indicated the word is not present in the current input and nonzero number indicated the word occurrences in the current sentence. Since specific words of all training intents were included in the vector, it became very sparse for each training intent. As a result, most of the elements in the vector were zero. For building the linear SVM, we started with a combination of ‘hinge’ loss function and ‘L2’ regularization method.