Document-term matrix – Knowledge and References

Explore chapters and articles related to this topic

Big data text mining in the financial sector

Published in Noura Metawa, Mohamed Elhoseny, Aboul Ella Hassanien, M. Kabir Hassan, Expert Systems in Finance, 2019

Mirjana Pejić Bach, Živko Krstić, Sanja Seljan

The next approach, based on machine learning, would be to create a large dataset containing documents that are first classified manually (by humans). Based on the classification, a machine-learning model can be developed that can provide rules for automated classification. Problem can be addressed as classification of two classes (positive or negative) or more (e.g. range from 1 to 5 for sentiment intensity). Features can be unigrams, bigrams, or combination of both (Go et al., 2009). A document term matrix is built based on our features and values in this matrix, which can be either frequencies like TF (term frequency), TF-IDF (term frequency–inverse document frequency) or binary representation. In our example of big data architectures, a machine-learning model can be used on batch data but also on real-time data in order to perform real-time classification. Accuracy can be greater than 80% even with simple algorithms with correct feature selection and a noise removal process (Narayanan et al., 2013).

Text as data

View Chapter

Purchase Book

Published in Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton, Texts in Statistical Science, 2017

Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton

Another important technique in text mining involves the calculation of a term frequency-inverse document frequency (tf-idf), or document term matrix. The term frequency of a term t in a document d is denoted tf(t, d) and is simply equal to the number of times that the term t appears in document d. On the other hand, the inverse document frequency measures the prevalence of a term across a set of documents D. In particular,

Information Retrieval Methods for Big Data Analytics on Text

View Chapter

Purchase Book

Published in Mohiuddin Ahmed, Al-Sakib Khan Pathan, Data Analytics, 2018

Abhay Kumar Bhadani, Ankur Narang

Document-term matrix is a numerical representation of terms present in a corpus in a vector format. A corpus can have multiple paragraphs. Here, one document represents one paragraph. A paragraph is a sequential collection of terms (or words) in some order, which conveys some meaning. The corpus can be a collection of documents in any of the languages (such as English, French, German, Hindi, and Chinese).

Web service discovery with incorporation of web services clustering

View Article

Journal Information

Published in International Journal of Computers and Applications, 2023

Sunita Jalal, Dharmendra Kumar Yadav, Chetan Singh Negi

To make fast discovery of relevant web services, we can organize web services of different domains into clusters. LDA and k-Medoids are combined to organize web services into clusters. LDA extracts latent topics and their dominant words from web services descriptions corpus. A web service description is represented as a collection of words called bag of words. LDA assumes that each web service description can have a probability distribution of different topics where each topic is represented as a probability distribution over a set of words, i.e. vocabulary. LDA model takes document-term matrix, number of topics, and other parameters such as α and β as input. The document-term matrix gives frequency of each term (word) within each document. LDA model learns the latent document-topic distributions and topic-word distributions using Gibbs sampling.

Analysis using natural language processing of feedback data from two mathematics support centres

View Article

Journal Information

Published in International Journal of Mathematical Education in Science and Technology, 2019

Anthony Cronin, Gizem Intepe, Donald Shearman, Alison Sneyd

Once documents have been cleaned and preprocessed, a standard way of representing them is the bag-of-words model. This model views a document as a bag-of-words, or collection of the terms it contains, where word order is ignored. For example, ‘the students studying’ and ‘studying the students’ would have the same representation. Document collections can then be represented as a document-term matrix or a term-document matrix. The columns of the document-term matrix are indexed by the corpus vocabulary, and the rows are indexed by the corpus documents. The ij-th entry is the count of the j-th term in the vocabulary from the i-th document. The term-document matrix is the transpose of document-term matrix.

Review and Implementation of Topic Modeling in Hindi

View Article

Journal Information

Published in Applied Artificial Intelligence, 2019

Santosh Kumar Ray, Amir Ahmad, Ch. Aswani Kumar

Latent Semantic Analysis, also referred as Latent Semantic Indexing (LSI), is a knowledge representation technique that creates a vector-based presentation of the content of a text (Dumais et al. 1988; (Thomas, Foltz, and Laham 1998). The underlying idea behind LSA is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. LSA does not require any human constructed dictionary, grammar, parser or any other tool. Its input is only the raw text files. Each document in the corpus is represented as a word count vector of length W where W is the number of words in the corpus dictionary. The dictionary is usually created using the corpus itself. Thus, the corpus can be represented by a matrix, called document-term matrix, of dimension D × W where D is the number of documents in the corpus. Each cell of the matrix contains the TF-IDF score of the word in the corresponding document. Then LSA uses Singular Value Decomposition (SVD) on the matrix to map documents and terms to a vector space of reduced dimensionality (equal to the number of desired topics), the latent semantic space (Deerwester et al. 1990). This reduced latent semantic space is further used to find similar words and documents by using techniques such as cosine similarity method. LSA model has been used to replicate semantic categorical clustering of words found in certain neuropsychological (Laham 1997), sentence comprehension (Kintsch 1998), selection of reviewers for a paper (Dumais and Nielsen 1992) and research article recommendation (Foltz and Dumais 1992). Demonstration of some of the applications of LSA can be seen on the website2.