Explore chapters and articles related to this topic
Natural Language Processing
Published in Subasish Das, Artificial Intelligence in Highway Safety, 2023
LDA is the most popular topic model used for extracting trends of topics from textual data. A detailed introduction of LDA can be found in the Blei et al. (2003) study. A very short introduction of LDA is described here. Suppose there is a group of documents D = {d(1), d(2), ..., d(N}. A particular topic t is a discrete distribution over words with vector ϕt. A Dirichlet prior can be placed over Φ = 1ϕ1, ...ϕT}. This prior is assumed to be symmetric with parameter ß: P(Φ)ΠtDir(ϕt;βς)=Πtr(β)Πdr(βD)Πdϕd/tβD−1τ(Σdϕd/t−1)
Mining Online Public Opinions on Megaprojects
Published in Johan Ninan, Social Media for Project Management, 2022
Zhipeng Zhou, Xingnan Zhou, Lingfei Qian, Haonan Qi
Online public opinion management often involves two main aspects of topic analysis and sentiment analysis in the area of natural language processing technology. Topic analysis is a machine learning technique to identify the frequent topics from large collections of unstructured data (O′Callaghan et al., 2015). Topics which are deemed as patterns across datasets are crucial to the description of an issue or a phenomenon. Topic classification and modelling can be utilized to make sense of seemingly uncorrelated data (Zirn and Stuckenschmidt, 2019). Among the algorithms for topic modelling, the most popular is latent Dirichlet allocation (LDA), which is a three-layer hierarchical Bayesian model of word, topic, and document (Blei et al., 2003). LDA model is widely used for its clear structure, high efficiency, and accuracy (Chen, 2017). LDA is an unsupervised machine learning technique which can be utilized to identify hidden topic information in large-scale document collections or corpus.
A Supervised Guest Satisfaction Classification with Review Text and Ratings
Published in Qurban A. Memon, Shakeel Ahmed Khoja, Data Science, 2019
Himanshu Sharma, A. Aakash, Anu G. Aggarwal
There exist some latent dimensional variables that are a representation of large number of attributes, but the consumers might not explicitly mention. There occurs a need to introduce techniques to evaluate these collections (termed as documents) by sorting, probing, tagging, and searching by using computers. With the help of machine learning (ML), extant researchers have successfully proposed a model that finds a pattern of words in these documents under hierarchical probabilistic models. This is referred to as topic modeling. The rationale behind topic modeling is to identify the word-use pattern and how the documents portraying similar pattern should be connected. Under text analytics, the model makes use of bag-of-words concept and ignores the word ordering [23]. The topic modeling generally depends on the four methods, namely latent semantic analysis (LSA), probabilistic LSA (PLSA), latent Dirichlet allocation (LDA), and correlated topic model (CTM). The LSA, earlier known as latent semantic indexing (LSI), creates vector-based representation of texts to make semantic content, by making use of a predefined dictionary [24]. PLSA automates document indexing based on a statistical model for factor analysis of count data, without referring to a predefined dictionary [25]. LDA is a Bayesian-based unsupervised technique for topic discovery in abundant documents, without considering any parental distribution [24]. CTM helps in discovering the topics in a group of documents, underlined by a logistic normal distribution [26].
YouTube as a source of information: early coverage of the COVID-19 pandemic in the context of the construction industry
Published in Construction Management and Economics, 2023
S M Jamil Uddin, Alex Albert, Mahzabin Tamanna, Abdullah Alsharef
After the preprocessing was complete, the widely adopted python library “gensim” (Rehurek and Sojka 2010) was leveraged to apply the LDA algorithm. The algorithm requires the number of topics that are necessary for model estimation as input. The appropriate number of topics was identified iteratively using the coherence score which distinguishes between semantically interpretable topics and topics that are mere artefacts of statistical inference using the approach proposed by Stevens et al. (2012). The coherence of a model is measured by the degree of semantic similarity between the words that co-occur in a topic (Mimno et al.2011). A model is said to be coherent if all or most of the keywords (most frequent words) within a topic are similar and can be interpreted in a context that expresses meaning. Coherence is a popular and widely used measure to determine the appropriate number of topics in topic modelling (AlSumait et al.2009, Mimno et al.2011, Evans 2014). A model’s coherence is determined as the sum of pairwise distributional similarity scores over the set of topic words (Stevens et al.2012). To identify the appropriate number of topics for this study, the coherence score for models that possessed between one and ten topics was estimated, and the model that possessed the highest coherence score was selected in the current study. Figure 3 presents the coherence score that corresponded with the varying number of topics. Based on the highest coherence score, a total of 6 topics were selected for the LDA model.
Web service discovery with incorporation of web services clustering
Published in International Journal of Computers and Applications, 2023
Sunita Jalal, Dharmendra Kumar Yadav, Chetan Singh Negi
LDA is a well-known generative probabilistic topic model for collection of discrete data. Topic model is an effective tool that uses unsupervised learning for extracting topics from text corpora. LDA considers each document as a collection of different topics and each topic is characterized by a probability distribution over a collection of words. LDA model uses the following equation: where z is the set of N topics, w is the set of N words, alpha(α) is the parameter of dirichlet distribution for topics in documents, beta(β) is the parameter of dirichlet distribution for words in topics and theta(θ) is the topic mixture. The gives distribution of topics for a document, given topic mixture θ. The is multinomial probability of conditioned on topic , and is dirichlet distribution. Parameter α controls the parameter θ. The high value of parameter α gives the uniform distribution of θ, while low value of α gives sparse distribution of θ. LDA employs unsupervised learning to determine the distributions of topics within each document and the distributions of words to each topic. Please refer to the work done in [15] for getting details on LDA.
Using topic modeling to infer the emotional state of people living with Parkinson’s disease
Published in Assistive Technology, 2021
Andrew P. Valenti, Meia Chita-Tegmark, Linda Tickle-Degnen, Alexander W. Bock, Matthias J. Scheutz
LDA is built around the intuition that documents exhibit multiple topics (Blei et al., 2003). LDA makes the assumption that only a small set of topics are contained in a document and that they use a small set of words frequently. The result is that words are separated according to meaning and documents can be accurately assigned to topics. LDA is a generative data model which as the name implies describes how the data is generated. This idea is to treat the data as observations that arise from a generative, probabilistic process, one that includes hidden variables, which represent the structure we want to find in the data. For our data, the hidden variables represent the thematic structure (i.e., the topics) that we do not have access to in our documents. Simply put, a generative model describes how the data is generated, and inference is used to backtrack over the generative model to discover the set of hidden variables which best explains how the data were generated. To express the model as a generative probabilistic process, we start by assuming that there is some number of topics that the document contains, and each topic is a distribution over terms (words) in the vocabulary. Every topic contains a probability for every word in the vocabulary, and each topic is described by a set of words with different probabilities reflecting their membership in the topic. The LDA generative process can be described as follows: