Vit – Knowledge and References

Explore chapters and articles related to this topic

Pre-trained and Application-Specific Transformers

Published in Uday Kamath, Kenneth L. Graham, Wael Emara, Transformers for Machine Learning, 2022

Uday Kamath, Kenneth L. Graham, Wael Emara

Lastly, the Vision Transformer investigates a modification to the self-attention mechanism, axial attention [114, 126]. Axial attention, where attention is between patches in the same row or the same column. ViT actually creates axial transformer blocks, where there is a row attention mechanism followed by a column attention mechanism.

Automatic pancreas anatomical part detection in endoscopic ultrasound videos

View Article

Journal Information

Published in Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 2023

Antoine Fleurentin, Jean-Paul Mazellier, Adrien Meyer, Julieta Montanelli, Lee Swanstrom, Benoit Gallix, Leonardo Sosa Valencia, Nicolas Padoy

Our results show that models using temporal context (C2 and C3) obtain significantly better results than models based only on static frames (C1). For example, the addition of LSTM (C2) improved the results by an average of 4.7 points and up to 10 points (Figure 4) compared to corresponding C1 configuration. Moreover, we can notice that generally the backbones based on ViT obtain better results than the backbones using CNN. Indeed, in average ViT models obtain an accuracy of 58.6 and 61.6 in C1 and C2 respectively, while the CNN obtain an accuracy of 53.3 and 57.5 in C1 and C2 respectively. Moreover MViT outperformed all CNN based video classifier models (C3). We suspect that the intrinsic behaviour of ViT could capture more information on the anatomical context than CNN.

Towards accurate surgical workflow recognition with convolutional networks and transformers

View Article

Journal Information

Published in Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 2021

Bokai Zhang, Julian Abbing, Amer Ghanem, Danyal Fer, Jocelyn Barker, Rami Abukhalil, Varun Kejriwal Goel, Fausto Milletarì

The success of applying transformers in NLP has recently inspired researchers to use them for computer vision problems. For image classification, Vision Transformer (ViT) (Dosovitskiy et al. 2020) has shown better results than several convolutional methods. For video classification, Video Vision Transformer (ViViT) (Arnab et al. 2021) has achieved state-of-the-art results on multiple video classification benchmarks.

MDvT: introducing mobile three-dimensional convolution to a vision transformer for hyperspectral image classification

View Article

Journal Information

Published in International Journal of Digital Earth, 2023

Xinyao Zhou, Wenzuo Zhou, Xiaoli Fu, Yichen Hu, Jinlian Liu

To address the limitations of the above model, the HSI classification task is here reconsidered from the perspective of sequence data using a vision transformer (ViT) (Hong et al. 2022), which is currently the most popular method in the computer vision (CV) field (Dosovitskiy et al. 2020). The ViT method has achieved similar success to CNN in the CV field using only the self-attention mechanism. As a new backbone network, ViT performs excellently in solving long-term dependency problems and global feature extraction (Bazi et al. 2021). However, ViT lacks some of the inductive biases inherent to CNNs, such as translational equivalence and locality (Yang et al. 2019; Wu et al. 2020). Therefore, it does not generalize well when training with insufficient data. Swin Transformer (Liu et al. 2021) is based on ViT and uses a hierarchical construction method similar to CNN, fusing information from different feature maps while also adding sliding windows so that information can be passed between adjacent windows. Tokens-to-Token (T2T) (Yuan et al. 2021) proposes a progressive tokenization module to integrate adjacent tokens, which not only models local information, but also reduces the length of token sequences. The information extraction methods based on ViT and CNN models are different, and many scholars have tried to integrate the advantages of each model. The convolutional vision transformer (CvT) removes positional embedding, uses convolutional layers to expand the local perceptual field, and uses down-sampling operations to decrease the number of parameters in the model, further improving the performance and robustness of ViT (Wu et al. 2021). The structure of the pyramid vision transformer (PvT) is roughly similar to that of the CvT in that it controls the size of patches through a linear layer to obtain the number of features in the next layer (Wang et al. 2021). The conformer model designs a two-branch structure of CNN and transformer, using a feature coupling unit (FCU) to exchange information between the two branches and obtain two classification results; the output is then averaged to make the final prediction (Peng et al. 2021).