Explore chapters and articles related to this topic
Malware classification using neural network
Published in Sangeeta Jadhav, Rahul Desai, Ashwini Sapkal, Application of Communication Computational Intelligence and Learning, 2022
Deeptanshu Singh Rathore, Ashwini Sapkal, Geeta Patil, Rahul Desai, Aparna Joshi
Figure 5.1 is the flow diagram of the system containing all the steps involved in the experiment. The step includes feature extraction, model building, classification and analysis of results. Feature Extraction: Feature extraction starts with a set of measured data and generates derived values that are meant to be helpful and non-redundant, making the learning and generalisation phases easier and, in some cases, resulting in improved human interpretations. The performed experiment considers two types of features for textual classification, count vector and trigrams.Model Building: Providing training data to an ML algorithm (that is, the learning algorithm) is the first step in training an ML model. The model artefact developed by the training process is referred to as an ML model. For the classification based on images a five layer convolutional neural network model is used and for the textual classification a five layer neural network model is used.Analysis of result: A same set of samples are used for training and testing in all the approaches and the results are compared based on the accuracy score.
Automatic Speech Recognition for Large Vocabularies
Published in John Holmes, Wendy Holmes, Speech Synthesis and Recognition, 2002
In order to accommodate cross-word triphone models, the state network for a Viterbi search needs to include multiple entries for each word to cover all possible different triphones that may end any one word. Similarly, a trigram language model can be used by expanding the network to keep multiple copies of each word so that each transition between words has a unique two-word history. To make the network manageable, it is usually represented as a tree structure, as shown in Figure 12.4. In the tree, different hypotheses that start with the same sequence of sub-word models share those models. This tree network is built dynamically as required.
Text mining and topic modeling
Published in Uwe Engel, Anabel Quan-Haase, Sunny Xun Liu, Lars Lyberg, Handbook of Computational Social Science, Volume 2, 2021
Raphael H. Heiberger, Sebastian Munoz-Najar Galvez
The third and last PS is using ngrams to concatenate words appearing next to each other in a text (Jurafsky & Martin, 2000). Bigrams (two-grams) consist of neighboring words, trigrams of three, and so on. For instance, “United States” is a bigram. Not combing both to “United_States” would lead to less information since each word on its own transports a different meaning. There exist different methods to find the most meaningful and frequent ngrams in a text, yet results are rather similar. We will use the “lambda” method proposed by Blaheta and Johnson (2001) as implemented in quanteda (Benoit et al., 2018).
SentiXGboost: enhanced sentiment analysis in social media posts with ensemble XGBoost classifier
Published in Journal of the Chinese Institute of Engineers, 2021
Roza Hikmat Hama Aziz, Nazife Dimililer
N-grams are a set of co-occurring n words in the set of given documents. Based on the number of words considered, they can be called a unigram (n = 1), a bigram (n = 2), or a trigram (n = 3). To compute n-grams, we use a moving window of n words as described in Mtetwa, Awukam, and Yousefi (2019). In this work, we use bigram to extract representation of the sample more accurately.
Improving the service quality of telecommunication companies using online customer and employee review analysis
Published in Quality Management Journal, 2020
Akhouri Amitanand Sinha, Suchithra Rajendran, Roland Paul Nazareth, Wonjae Lee, Shoriat Ullah
The bigram and trigram analysis used in this study is based on the discussion provided in Jurafsky and Martin (2014). A bigram and trigram are defined as the occurrence of two and three words in a sequence, respectively. In general, the appearance of words in a series can be referred to as -grams.
Mapping near-real-time power outages from social media
Published in International Journal of Digital Earth, 2019
Huina Mao, Gautam Thakur, Kevin Sparks, Jibonananda Sanyal, Budhendra Bhaduri
Table 2 compares classification results across multiple machine learning models, including logistic regression, multinomial and Bernoulli naive bayes, and linear support vector classification (SVC) (i.e. a linear support vector machine for classification). These classifiers are implemented by Python scikit-learn libraries, where the default parameter settings are adopted (parameter updates do not show a significant change in classification performance). For each model, we compared different feature sets including (1) features generated by the bag-of-words model (i.e. individual word of the data set) weighted by the term frequency and term-frequency inverse-document-frequency (TFIDF) (Sparck Jones 1972), and (2) features generated by the bag-of-ngrams model weighted by TFIDF. TFIDF measures the importance of a term (t) to a document (d) in a collection of documents (D), which is calculated by , where is the occurrence frequency of t in d and is the logarithm of the inverse of the division between the number of documents containing t and the total number of documents in D. TFIDF of ngrams is generated by the bag-of-ngrams model, where ngrams is a contiguous sequence of n words. Specifically, we used a combinations of unigrams (i.e. a single word) and bigrams (i.e. two consequent words). We also added trigrams (i.e. three words) as features, but performance did not improve. Results were evaluated in terms of four metrics: precision, recall, F-score, and accuracy, calculated as shown below.where tp represents true positive, tn represents true negative, fp represents false positive, and fn represents false negative. The set of tweets labeled as true power outages and also are correctly detected by our method constitutes true positive data. The set of true power outage tweets that were misclassified as negative constitutes true negative data. The set of true power outage tweets missed by our method constitutes false negative. The set of false power outage data misclassified as true power outage tweets (i.e. false alarms) constitutes false positive data. According to the definition in Equation 1, precision measures the fraction of relevant tweets from all the tweets identified as relevant, while recall is the fraction of relevant tweets within all the actual relevant documents. The precision decreases or increases as the recall increases or decreases. F-score is a weighted average of the precision and recall, which we use to compare the performance of various machine learning classifiers.