Bag-of-words model – Knowledge and References

Explore chapters and articles related to this topic

Performance of serial and parallel processing on online sentiment analysis of fast food restaurants

Published in Shin-ya Nishizaki, Masayuki Numao, Jaime Caro, Merlin Teodosia Suarez, Theory and Practice of Computation, 2019

B. Quijano, M.R. Nabus, L.L. Figueroa

Analysis of people’s sentiments and opinions was not an active area of research until the early 2000s. The surge of active research on sentiment analysis (Nasukawa & Yi, 2003), also known as opinion mining (Dave et al. 2003), was due to the massive increase of opinionated data coming from the World Wide Web in particular; examples of these opinionated data are online user reviews of service institutions such as hotels, restaurants, and travel agencies (Pang & Lee, 2008). While processing natural language has extensively made use of non-probabilistic models such as Chomsky Normal Form (Chomsky, 1959) in the past, statistical and probabilistic models such as the bag-of-words model have made a resurgence due to the increased processing power of computers that can be used to handle huge amounts of data (Lee, 2016). The bag-of-words model is the representation of words in a text document as a set of words and their frequencies, regardless of the ordering of the words in that document (Brownlee, 2017). Document classification methods such as the Naive Bayes classification makes use of this model for applications such as spam filtering (Sahami et al. 1998).

Word Embeddings

View Chapter

Purchase Book

Published in Jan Žižka, František Dařena, Arnošt Svoboda, Text Mining with Machine Learning, 2019

Jan Žižka, František Dařena, Arnošt Svoboda

The main problem with the bag-of-words model is that it does not capture relations between words. In the bag-of-words model, each word or other feature of a text is represented by one dimension in a multidimensional space for representing the documents (we talk about one-hot representation [101]). Each such dimension is independent of the others because it is represented by only one value which does not enable sharing some information across features. It is, therefore, not possible to say that, for example, the word football is more similar to the word soccer than to the word ballet.

Comparative Study of Machine Learning Algorithms on Sentiment Analysis of Product Reviews

View Chapter

Purchase Book

Published in Durgesh Kumar Mishra, Nilanjan Dey, Bharat Singh Deora, Amit Joshi, ICT for Competitive Strategies, 2020

Ujwala Baruah, Ratnadeep Das, Amitabha Deb, Shah Alam Mazumder

Bag-of-words model: Bag-of-words is a model where the features are the individual words of a sentence, assuming that the words are conditionally independent. The text is converted into a number of feature vectors where each feature represents the existence of one word. Bag-of-words is basically an unordered collection of words, and these words are selected from the texts through calculated feature selection methods.

Deep mining of open source software bug repositories

View Article

Journal Information

Published in International Journal of Computers and Applications, 2022

Abeer Hamdy, Gloria Ezzat

Embedding (also known as distributed representation [22,37]) is a technique for learning the representation of entities (words or sentences) as real-valued, fixed-length vectors in a high-dimensional space. The vectors are learned such that the entities that have similar meanings are close to each other in the vector space [26]. Embedding provides a more expressive representation for text than classical methods like bag-of-words model, where semantic similarity between words or tokens are ignored, or considered through using n-grams. Word embedding (or Word2vec) is the vector representation of words. Mikolov et al. [22] proposed two techniques for Word2vec namely ‘Continuous Bag of Words’ (CBOW) and ‘Skip-Gram’ (SG). Where, a neural network that captures the relations between a word and its contextual words is built [22]. CBOW learn word representations that maximize the classification of the current word based on the context words in the same sentence. While SG learns word representations that maximize the classification of the surrounding words based on the current word in the same sentence.

An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2018

Abdul Jabbar, Sajid Iqbal, Adnan Akhunzada, Qaisar Abbas

Conferring from recent studies, the NLP applications utilize the bag-of-words model that breaks the input text stream into uni-grams known as features. Each word is defined as a feature in the bag-of-words model. If there are multiple morphological forms of a word, then it contributes more features. However, if each inflectional or derivational form is reduced to its stem, the features are minimized and results are obtained with a little computation. The benefit of stemming is a multidimensional like reduction of features. These benefits and requirements are also required for the development of new type of algorithms like optimization of features through principle component analysis (PCA), the design of rule-based approaches (that are good for small features), case-based reasoning and other approaches besides machine learning approaches. These approaches are best suitable for a large number of features.

Automatic text classification using BPLion-neural network and semantic word processing

View Article

Journal Information

Published in The Imaging Science Journal, 2018

Nihar M. Ranjan, Rajesh S. Prasad

A delicate option for retrieving the information is text mining [3]. It is considered as an art research topic of the research community as well as the business world [4]. Basically, the process of text mining involves the extraction of relevant information from unstructured data (text) [5]. Text categorization comes under text mining, which labels the documents with a pre-defined set of topics [6]. Bag of words model (keywords) is used for text categorization, where all the independent features of a word in the text documents are considered for categorization. This has resulted in many inadequacies, such as ignorance of related words and dimensionality problem [7].