Gensim – Knowledge and References

Explore chapters and articles related to this topic

A Computational Linguistic Approach to Modelling the Dynamics of Design Processes

Published in Bo T. Christensen, Linden J. Ball, Kim Halskov, Analysing Design Thinking: Studies of Cross-Cultural Co-Creation, 2017

Joel Chan, Christian D. Schunn

We used the gensim library (R̆ ehů r̆ ek & Sojka, 2010) to learn the topic model. This library uses the online LDA algorithm by Hoffman, Bach, and Blei (2010). Since the model iterates on simultaneous estimation of topic content and topic distribution across documents, we ran the algorithm for each parameter setting combination for 5,000 iterations.

Learning Bilingual Word Embedding Mappings with Similar Words in Related Languages Using GAN

View Article

Journal Information

Published in Applied Artificial Intelligence, 2022

Ghafour Alipour, Jamshid Bagherzadeh Mohasefi, Mohammad-Reza Feizi-Derakhshi

To construct a text corpus from Wikipedia without article markups, punctuations, and links, we use the WikiCorpus tool from gensim,1 an XML parser library for Python, which converts Wikipedia dump files to text corpus. To pre-process the text corpus for the Word2Vec model, we convert all the corpus text to lowercase form and delete all the special characters, digits, and extra spaces from the text. After that, we use the Word2vec implementation of the gensim library to provide a monolingual embedding model in each language. As for Word2vec parameters, no lemmatization was done, the window size was set to 5, and the output dimensions were set to 768. We only estimated representation vectors for words, which occurred five times or more in the monolingual corpus. Figure 3 shows the learning process for word vectors in each language.

Analysing Information Diffusion in Natural Hazards using Retweets - a Case Study of 2018 Winter Storm Diego

View Article

Journal Information

Published in Annals of GIS, 2021

Jinwen Xu, Yi Qiang

Three types of NLP tools were utilized for text mining. First, to recognize informative words from the tweet texts, the Python package spaCy and NLTK were used (Isaak and Michael 2016; Honnibal and Montani 2017; Paramkusham 2017). Owing to spaCy’s strong ability in syntactic parsing, disorganized and lengthy texts in tweets were tokenized and lemmatized to informative keywords. Second, Gensim, which is a Python package for LDA (Řehůřek and Sojka 2011), was used for classifying tweets into different topics. LDA is one of the most popular unsupervised soft-clustering methods and has been frequently applied in topic modelling (Blei, Ng, and Jordan 2003; Resch, Usländer, and Havas 2018). As shown in the previous studies, LDA can detect topics covering different aspects in a crisis (Imran et al. 2015; Kireyev, Palen, and Anderson 2009). Gensim trains an LDA model that categorizes texts into different topics. The output of Gensim includes the percentage contribution of each tweet to the topic, which represents the importance the topic within that tweet. Third, the lexicon-based Python package VADER (Valence Aware Dictionary for sEntiment Reasoning) was used for sentiment analysis, which can help understand public opinions and perception towards the storm-related information (Caragea et al. 2014; Zou et al. 2018).

Natural language processing (NLP) in management research: A literature review

View Article

Journal Information

Published in Journal of Management Analytics, 2020

Yue Kang, Zhao Cai, Chee-Wee Tan, Qian Huang, Hefu Liu

Topic modeling mimics the data generation process in that the writer chooses a topic to write about and then chooses words to express these topics. Topics are defined as word distributions that commonly co-occur and thus have a certain probability of appearing in a topic. A document, then, is then described as a probabilistic mixture of topics. The most frequently used tool for topic modeling is latent Dirichlet allocation (LDA), as it can be easily realized with the help of gensim in Python. The output of LDA are the topics and corresponding keywords. The number of topics and keywords is up to users.