LSH – Knowledge and References

Explore chapters and articles related to this topic

Transformer Modifications

Published in Uday Kamath, Kenneth L. Graham, Wael Emara, Transformers for Machine Learning, 2022

Uday Kamath, Kenneth L. Graham, Wael Emara

Locality-sensitive hashing, or LSH, was introduced in 1998, in [129] as a method of approximate similarity search based on hashing. In formal terms, LSH is based on a family of hash functions F that operate on a collection of items where, if any two such items x and y, Probh∈F[h(x)=h(y)]=sim(x,y), where sim(x,y)∈[0,1] in a similarity function defined on the collection of items [39]. Or, in other words, you have an LSH method whenever you can hash the items in a dataset such that the collision probability of any two items is much higher when those items are close together than when they're far apart.

Toward Practical Anomaly Detection in Network Big Data

View Chapter

Purchase Book

Published in Yulei Wu, Fei Hu, Geyong Min, Albert Y. Zomaya, Big Data and Computational Intelligence in Networking, 2017

Chengqiang Huang, Yulei Wu, Zuo Yuan, Geyong Min

Local sensitive hashing (LSH) is a type of hashing that maintains the similarity among data instances even after dimension reduction [21]. Thus, LSH is mainly utilized by applications in which similarities among data instances of high dimensionality need to be measured. As a concrete example, SimHash [22], a well-known method of LSH, has been practically employed by Google for removal of duplicate web pages, which normally contain hundreds or thousands of features. LSH is a very good candidate for data approximation and dimension reduction, all of which point out its capability of anomaly detection with network big data. However, it is only lately [13, 14] that LSH has attracted major research interest.

O(1) Search by Hashing

View Chapter

Purchase Book

Published in Suman Saha, Shailendra Shukla, Advanced Data Structures, 2019

Suman Saha, Shailendra Shukla

[myfirstindex]Locality Sensitive Hashing Locality sensitive hashing (LSH) is a data structure to perform probabilistic dimension reduction of high-dimensional data and efficiently handle similarity searches, thus defeating the “curse of dimensionality.” The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

Optimizing user profile matching: a text-based approach

View Article

Journal Information

Published in International Journal of Computers and Applications, 2023

Youcef Benkhedda, Faical Azouaou

The user text is clustered using Locality Sensitice Hashing technique in order to group user document vectors into clusters based on their similarity. LSH works by utilizing a set of mathematical functions to assign data points into different buckets [46]. The aim is to place similar items into the same bucket and dissimilar items into separate buckets. By mapping the original high-dimensional vectors into lower-dimensional hashed vectors, LSH simplifies the retrieval process by reducing the search space, which leads to a smaller number of candidate neighbors. This is especially useful when dealing with large datasets. LSH generates multiple hyper-planes to create hash signatures for each item. These hyper-planes have a similar size to the item's embedding, which is generally much smaller than the original user's embedding. This allows for efficient computation and storage of the hash signatures. In addition to reducing the search space, LSH also provides a way to balance the trade-off between precision and recall. By adjusting the number of hyper-planes used in the hash function, we can control the number of false positives and false negatives. More hyper-planes lead to a higher precision but lower recall, while fewer hyper-planes lead to a higher recall but lower precision. Therefore, the choice of hyper-planes is critical in achieving optimal performance for a given application.

A review of approaches for topic detection in Twitter

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2021

Zeynab Mottaghinia, Mohammad-Reza Feizi-Derakhshi, Leili Farzinvash, Pedram Salehpour

Petrovic et al. (Petrović et al., 2010) use the streaming First Story Detection (FSD) model to analyse Twitter in real-time. The authors presented an algorithm based on locality sensitive hashing (LSH) (Indyk & Motwani, 1998) to reach constant processing time. LSH is a randomized technique that reduces the time needed to find the nearest neighbour in vector space and keeps the memory use constant to save space. In experiments, they did not consider replies, retweets, and hashtags. The paper showed that the method is able to overcome the limitations of traditional approaches and could reach significant speedup. Also, the results have shown that ranking according to the number of users is better than ranking according to the number of tweets and the amount of spam messages in output has been decreased by considering entropy of the messages.

Big Data Retrieval Using Locality-Sensitive Hashing with Document-Based NoSQL Database

View Article

Journal Information

Published in IETE Journal of Research, 2021

N.R. Gayathiri, A.M. Natarajan

LSH is a technique to create Hash codes for data points which has the property that comparative information points will have similar Hash codes. It is perfect to have similar documents/images to have the common Hash codes so that the Hash codes can be utilized to preserve the data point similarity for the rapid search. Instinctively data points that are nearby in the space have a lower probability of being inaccessible by any given line than those that are less similar than each other.