Explore chapters and articles related to this topic
Document Clustering: The Next Frontier
Published in Charu C. Aggarwal, Chandan K. Reddy, Data Clustering, 2018
David C. Anastasiu, Andrea Tagarelli, George Karypis
A widely used refinement to the vector space model is to weight each term based on its inverse document frequency (IDF) in the document collection. The motivation behind this weighting is that terms appearing frequently in many documents have limited discrimination power and thus need to be de-emphasized. This is commonly done [91] by multiplying the frequency of the ith term by log(N/dfi), where dfi is the number of documents that contain the ith term (i.e., document frequency). This leads to the tf-idf representation of the document: dtf−idf=(tf−idf1,tf−idf2,…,tf−idfM).
Introduction
Published in Sugato Basu, Ian Davidson, Kiri L. Wagstaff, Constrained Clustering, 2008
Sugato Basu, Ian Davidson, Kiri L. Wagstaff
In order to represent the documents, we used the vector space model [17]. In the vector space model, it is assumed that each document can be represented as as term vector of the form a‾=a1,a2,…an. Each of the terms ai has a weight wi associated with it, where wi denotes the normalized frequency of the word in the vector space. A well-known normalization technique is the cosine normalization. In cosine normalization, the weight wi of the term i is computed as follows: wi=tfi⋅idfi∑i=1ntfi⋅idfi2
Towards intelligent geospatial data discovery: a machine learning framework for search ranking
Published in International Journal of Digital Earth, 2018
Yongyao Jiang, Yun Li, Chaowei Yang, Fei Hu, Edward M. Armstrong, Thomas Huang, David Moroni, Lewis J. McGibbney, Christopher J. Finch
The relevance score is based on the practical scoring function developed for Lucene (Gormley and Tong 2015). This function borrows concepts from term frequency/inverse document frequency and the vector space model but adds more modern features such as field-length normalization, as described in the Appendix. Term frequency represents how often a term appears in a document – the more often, the higher the weight. Inverse document frequency refers to how often a term appears in all documents in a collection – the more often, the lower the weight. Field-length normalization represents the normalized length of a field – the shorter the field, the higher the weight. The vector space model is a way of comparing a multiterm query against a document by representing both the query and the document as vectors.
The Bitwise Hashing Trick for Personalized Search
Published in Applied Artificial Intelligence, 2019
Cosine Similarity is popular as a similartiy measure in the vector space model for text retrieval(Ida 2008). In vector space text retrieval, the discrimination of syntactic elements of text is commonly used to weight each dimension in the vector space. Syntactic elements include words, phrases, or overlapping N-grams. The weights are often the output of a TF-IDF calculation (inverse document frequency times term frequency).