Visual word – Knowledge and References

Explore chapters and articles related to this topic

Perception

Published in Hanky Sjafrie, Introduction to Self-Driving Vehicle Technology, 2019

One of the most popular feature-based visual localization algorithms is ORB-SLAM2 [45], which works for mono, stereo and RGB-D cameras. ORB-SLAM2 uses Oriented FAST and Rotated BRIEF (ORB) features [49], that are basically a combination of a variant of the FAST keypoint detector and a rotation-aware BRIEF binary descriptor. As shown in Figure 3.1 ORB-SLAM2 uses three main threads that perform tracking, local mapping and loop closure detection in parallel. The tracking thread performs ORB feature detection and matching to the local map. The local mapping thread manages the local map and performs local bundle adjustment. Finally, the loop closure detection thread detects loop closing to avoid map duplication, and corrects any accumulated drifts. In order for the algorithm to detect loop closing or to relocalize, e.g., due to tracking failure, a database of ORB features based on Discriminative Bags of Visual Words (DBoW2) [24] is used and maintained in the place recognition module. A bag of visual words is a concept inspired from natural language processing. A visual word is an informative region made of a set of local features. Visual vocabulary is a collection of visual words, typically generated by clustering the features from a large set of image training data. Hence, in the bag of visual words concept, an image is represented by a histogram of the frequency of visual words obtained in that image, regardless of their spatial information.

Clustering Multimedia Data

View Chapter

Purchase Book

Published in Charu C. Aggarwal, Chandan K. Reddy, Data Clustering, 2018

Shen-Fu Tsai, Guo-Jun Qi, Shiyu Chang, Min-Hsuan Tsai, Thomas S. Huang

Visual words learning, which involves vector quantization, is among one of the earliest adaptions of clustering algorithms in multimedia applications. Inspired by the success of the bag-of-words (BoW) model used in text domain, bag-of-visual-words (BoVW) or bag-of-features (BoF) model is proposed to represent an image with a visual word histogram. Similar to the BoW model in the text domain, the BoVW model uses a number of visual vocabulary to map the low level image patches or visual features to obtain a higher level visual word representation. However, unlike the text domain where the text can be naturally tokenized by the punctuation and white spaces (for certain languages) to obtain the vocabulary, BoVW models involve an additional step, usually referred to as visual words learning or vocabulary learning. Visual words learning employs vector quantization (VQ) or more sophisticated methods (such as sparse coding) to learn a set of visual words as the basis functions or the vocabulary, which is then used in the second stage of the BoVW model to encode the low-level visual-feature vectors into a new feature space. There are a number of advantages for the bag-of-visual-words model such as the ability to handle partial occlusion.

Making Content-Based Medical Image Retrieval Systems Worth for Computer-Aided Diagnosis: From Theory to Application

View Chapter

Purchase Book

Published in de Azevedo-Marques Paulo Mazzoncini, Mencattini Arianna, Salmeri Marcello, Rangayyan Rangaraj M., Medical Image Analysis and Informatics: Computer-Aided Diagnosis and Therapy, 2018

Agma Juci Machado Traina, Marcos Vinícius Naves Bedo, Lucio Fernandes Dutra Santos, Luiz Olmes Carvalho, Glauco Vítor Pedrosa, Alceu Ferraz Costa, Caetano Traina Jr.

One of the key issues in dealing with local features is that there may be differing numbers of feature points in each image, which increases the cost of comparing images. A popular strategy to overcome this problem is the Bag-of-Visual-Words (BoVW) representation (Boreau et. al. 2010; Jgou et al. 2010). This model encodes each local feature vector as a visual word. A visual word is generated by clustering local feature vectors detected in a set of training images. Each cluster is considered as a visual word, and a set of visual words is considered a visual dictionary. This representation has a final feature vector of fixed sized, making the task of computing the similarity between images based on local features easier. A drawback of the BoVW is that different images may have identical histograms of visual words, although some works have been proposed to address this problem by encoding spatial information into the BoVW representation (Lazebnik et al. 2006; Penatti et.al 2014; Tao et.al 2014; Savarese et al. 2006).

Scene Recognition by Joint Learning of DNN from Bag of Visual Words and Convolutional DCT Features

View Article

Journal Information

Published in Applied Artificial Intelligence, 2021

Abdul Rehman, Summra Saleem, Usman Ghani Khan, Saira Jabeen, M. Omair Shafiq

We are using the BOVW model built on SIFT and GIST features. This model gives us a feature set containing local and global representations of the scene. As the BOVW model is given the extracted features and then using k-means clustering model is constructed. The details can be reviewed from section 3.1. In the clustering process, we must specify the total number of clusters to be made. The number of clusters decide the feature size of the input image. In the extraction phase, the input image feature set contains an array of features, where each feature gives an occurrence of the visual word.