N-grams – Knowledge and References

Explore chapters and articles related to this topic

Drug Side Effect Frequency Mining over a Large Twitter Dataset Using Apache Spark

Published in Saravanan Krishnan, Ramesh Kesavan, B. Surendiran, G. S. Mahalakshmi, Handbook of Artificial Intelligence in Biomedical Engineering, 2021

Dennis Hsu, Melody Moh, Teng-Sheng Moh, Diane Moh

Sentiment analysis, also known as opinion mining, from text has been a popular tool in extracting text features used in machine learning. Sentiment analysis can be used on the n-gram feature, which are sequences of letters or words in the text. n-Grams have been around for two decades. Cavnar and Trenkle first introduced the concept of n-grams for text categorization of documents 0. There are two types of n-grams: word grams and character grams. Word grams convert documents into token count sequences based upon different words in the document while character grams break the document into sets of n-character sequences. The reasoning behind using n-characters is to be tolerant of errors in the text, especially with spelling. They were able to achieve a high accuracy of 80% in categorizing texts from news articles into groups. Using character n-grams is especially useful for Twitter, as tweets from users often have incorrect spelling as well as acronyms and short-hand words. n-Grams, from unigrams, which is one word or letter, all the way to four grams, a sequence of four words or letters, are used in our work.

Using literature-based discovery in built environment research

View Chapter

Purchase Book

Published in Emmanuel Manu, Julius Akotia, Secondary Research Methods in the Built Environment, 2021

Nathan Kibwami, Apollo Tutesigensi

Depending on the procedure used for extracting terms (i.e. manual, automatic), a manageable number and length of terms should be considered. In well-structured and online corpora (e.g. in MEDLINE), it is possible to know the approximate number of terms with which to work (Weeber et al., 2001, p. 551). However, for a semi-automated process, as suggested in this case, where articles are gathered manually from different databases, only an estimate is possible. For instance, for literature consisting of 20 articles, assuming an average length of a full article to be 7,000 words, this would mean working with 140,000 terms. To manage the winnowing process towards precision, an initial working number of terms from each context should be set. Meanwhile, the decision of setting the minimum length of terms (i.e. number of characters per term) depends on the desired precision and recall. Shorter terms are better in terms of recall but not precision. Also, terms can be unigrams (i.e. one word terms), bigrams (i.e. two-word terms), or n-grams (Ittipanuvat et al., 2013; Frantzi et al., 1998). For the current approach, unigrams were considered because of some limitations highlighted later. The recall for unigrams is usually high, since unigrams can exist either on their own or as nested terms (i.e. sub-terms of bigrams or n-grams). In Ittipanuvat et al. (2014), unigrams accounted for over three quarters of the total terms extracted.

Ideation

View Chapter

Purchase Book

Published in Walter R. Paczkowski, Deep Data Analytics for New Product Development, 2020

Walter R. Paczkowski

Once the DTM is created, a number of multivariate statistical procedures can be used to extract information from the text data. A common procedure is to extract key words and phrases and groups of words and phrases as topics. Conceptually, phrase extraction is the process of taking groups of words once the document has been tokenized, each group based on a prespecified maximum size. A group of size n is an n-gram. If n = 1, the group is a unigram; for n = 2, it is a bigram; for n = 3, it is a trigram; etc. Creating n-grams is tantamount to creating a small window of a prespecified size, placing it over a vector of tokens, and treating all the words inside the window as a phrase. The window can be moved to the right one token and all the words inside the new placement of the window would be a new phrase. This would be continued until the end of the vector of tokens. The phrases are then counted and a report created showing the frequency and (sometimes) length of each phrase. This is largely a counting function. See Sarkar [2016] for a discussion.

Automated Creation of an Intent Model for Conversational Agents

View Article

Journal Information

Published in Applied Artificial Intelligence, 2023

Alberto Benayas, Miguel Angel Sicilia, Marçal Mora-Cantallops

In many cases, especially in very short sentences, the intent is decided by the presence of certain lexical units. So this consideration is included by introducing the frequent key n-gram feature VK. An n-gram is a contiguous sequence of n items from a given sample of text. VK = {x1,···,xc} represents the information of n-grams in the utterance. After removing stop words, the top K n-grams (for n = 1,2,3,4) are chosen and the occurrence frequency of each n-gram as a discrete vector VK is counted. If domain experts are available, they can also define or include a specific set of n-grams to be counted. The length of the resulting vectors is, thus, 4.

Keyphrase Extraction Using Enhanced Word and Document Embedding

View Article

Journal Information

Published in IETE Journal of Research, 2022

Fahd Saleh Alotaibi, Saurabh Sharma, Vishal Gupta, Savita Gupta

an n-gram is defined as n sequential items from a given sample of text. It involves characters, words, phonetic units, and linguistics to process these sequential items together. N-gram techniques are mostly categorized as character-based, word-based, and bit-based. Here we assimilate word n-grams to improve unigram embeddings. We next presented retrofitting, a more advanced approach that can improve the quality of distributional word vectors. The advanced approach of retrofitting further assimilates the complex linguistic and semantic information of lexicons into word vectors. In comparison to other methods, it efficiently meliorates the semantic information of word vectors. The improved quality and semantically enriched word vectors result in the following merits: First, it is easy to improve the word vectors obtained from different training models. Second, it is simpler to process than other approaches that use semantic information while training. We now present the various embedding approaches retrofitted by word n-grams: Skip Gram retrofitted by word n-grams

Hybrid Attention-based Approach for Arabic Paraphrase Detection

View Article

Journal Information

Published in Applied Artificial Intelligence, 2021

Adnen Mahmoud, Mounir Zrigui

Given the pre-trained vectors of an input sequence , CNN captures invariant contextual features through convolutional, pooling and fully connected layers. It is used to extract the most descriptive and influential n-grams of different semantic aspects from the text. Given a window size , a convolution is based on a filter weight of size 64. It is defined in Equation (2) as the dot product between and each sequence of words, where the ReLU is a nonlinear activation function, and are the bias functions: