Explore chapters and articles related to this topic
Information Retrieval Methods for Big Data Analytics on Text
Published in Mohiuddin Ahmed, Al-Sakib Khan Pathan, Data Analytics, 2018
Abhay Kumar Bhadani, Ankur Narang
Latent semantic analysis (LSA), also known as latent semantic indexing (LSI), helps in analyzing documents so that the underlying meaning or concepts of those documents can be found [14]. LSA was proposed by [15] for processing tasks related to NLP. Ideally, each word should have a single meaning and concept, in which case LSA’s work would have been to describe each concept by one word; however, it is not the case. It has been observed that a single word carries different meanings when used in different contexts in almost all the languages. This creates a certain level of ambiguity that obscures the concepts and makes it hard for people to understand the meaning. In the English language, there are around 13 million estimated tokens; unfortunately, most of them are not completely unrelated, for example, hotel and motel. Not only that, some of them convey different meanings depending on the context of usage. For example, the word “bank” when used together with the names of rivers means a place beside the river, whereas if it is used with respect to loans, credit cards, exchange rates, and mortgage, then probably it means a financial institution. Now, we explain some of the basic differences between frequently used terms, such as latent semantic analysis (LSA), latent semantic indexing (LSI), and singular value decomposition (SVD).
Dimension Reduction Techniques
Published in Rashmi Agrawal, Marcin Paprzycki, Neha Gupta, Big Data, IoT, and Machine Learning, 2020
Muhammad Kashif Hanif, Shaeela Ayesha, Ramzan Talib
Latent Semantic Analysis (LSA) is an unsupervised linear mapping designed for text documents. It is based on the PCA or SVD computation. It is used to eliminate redundant features and preserve the semantic structure of documents in reduced representation. LSA was developed for information retrieval, especially when, from a huge collection of documents, only few relevant documents match with a given query. LSA is a vector-based technique that has been used to compare and represent the text of a high-dimensional corpus into lower dimensions (Dokun and Celebi 2015).
Human Performance
Published in Gawron Valerie Jane, Human Performance and Situation Awareness Measures, 2019
General description – Latent Semantic Analysis (LSA) is the creation of a word-by-document matrix in which the cells are populated with the frequency of occurrence of that word in that document. Log-entropy term weighting is then applied. Afterwards a singular value decomposition technique is used to identify the significant vectors. LSA does not consider word order or syntax.
An Analysis of Neural Word Representations for Wikipedia Articles Classification
Published in Cybernetics and Systems, 2019
Julian Szymański, Nathan Kawalec
Latent Semantic Analysis (LSA) (Dumais 2005) is a method for constructing an approximation of a term-document matrix using Singular Value Decomposition (SVD). LSA starts by creating a sparse matrix of size V × D where V is the size of the vocabulary and D is the number of documents in the chosen corpus. Each row corresponds to a unique word, and each column corresponds to a unique document. The cells of the matrix are populated with the counts of those words within the given document. TF-IDF is typically used to weight the values accordingly. After this, SVD is used to reduce the dimension of the matrix (rank) whilst preserving the similarity structure. The reasons for lowering the dimension are usually due to the original matrix being too large for the computing resources. Another reason for lowering the rank is to remove noise such as anecdotal instances of terms that are deemed irrelevant. As a consequence of the rank lowering, some of the dimensions that depend on more than one term are linearly combined, e.g.: (car), (truck), (flower) → (1.33*car + 0.28*truck), (flower). Hidden representations of words are obtained post SVD by reading the vector of a given row in the co-occurrence matrix. In our experiments, to compare results of usage representation based on LSA we use its first 100 components to have the same dimensionality as document representation based on neural embeddings.