Vector space model – Knowledge and References

Explore chapters and articles related to this topic

Document Clustering: The Next Frontier

Published in Charu C. Aggarwal, Chandan K. Reddy, Data Clustering, 2018

David C. Anastasiu, Andrea Tagarelli, George Karypis

A widely used refinement to the vector space model is to weight each term based on its inverse document frequency (IDF) in the document collection. The motivation behind this weighting is that terms appearing frequently in many documents have limited discrimination power and thus need to be de-emphasized. This is commonly done [91] by multiplying the frequency of the ith term by log(N/dfi), where dfi is the number of documents that contain the ith term (i.e., document frequency). This leads to the tf-idf representation of the document: dtf−idf=(tf−idf1,tf−idf2,…,tf−idfM).

Introduction

View Chapter

Purchase Book

Published in Sugato Basu, Ian Davidson, Kiri L. Wagstaff, Constrained Clustering, 2008

Sugato Basu, Ian Davidson, Kiri L. Wagstaff

In order to represent the documents, we used the vector space model [17]. In the vector space model, it is assumed that each document can be represented as as term vector of the form a‾=a1,a2,…an. Each of the terms ai has a weight wi associated with it, where wi denotes the normalized frequency of the word in the vector space. A well-known normalization technique is the cosine normalization. In cosine normalization, the weight wi of the term i is computed as follows: wi=tfi⋅idfi∑i=1ntfi⋅idfi2

Towards intelligent geospatial data discovery: a machine learning framework for search ranking

View Article

Journal Information

Published in International Journal of Digital Earth, 2018

Yongyao Jiang, Yun Li, Chaowei Yang, Fei Hu, Edward M. Armstrong, Thomas Huang, David Moroni, Lewis J. McGibbney, Christopher J. Finch

The relevance score is based on the practical scoring function developed for Lucene (Gormley and Tong 2015). This function borrows concepts from term frequency/inverse document frequency and the vector space model but adds more modern features such as field-length normalization, as described in the Appendix. Term frequency represents how often a term appears in a document – the more often, the higher the weight. Inverse document frequency refers to how often a term appears in all documents in a collection – the more often, the lower the weight. Field-length normalization represents the normalized length of a field – the shorter the field, the higher the weight. The vector space model is a way of comparing a multiterm query against a document by representing both the query and the document as vectors.

The Bitwise Hashing Trick for Personalized Search

View Article

Journal Information

Published in Applied Artificial Intelligence, 2019

Braddock Gaskill

Cosine Similarity is popular as a similartiy measure in the vector space model for text retrieval(Ida 2008). In vector space text retrieval, the discrimination of syntactic elements of text is commonly used to weight each dimension in the vector space. Syntactic elements include words, phrases, or overlapping N-grams. The weights are often the output of a TF-IDF calculation (inverse document frequency times term frequency).