Vision transformer – Knowledge and References

Explore chapters and articles related to this topic

Pre-trained and Application-Specific Transformers

Published in Uday Kamath, Kenneth L. Graham, Wael Emara, Transformers for Machine Learning, 2022

Uday Kamath, Kenneth L. Graham, Wael Emara

The most obvious application of transformers to image processing is image recognition (AKA image classification). Prior to transformers, the highest quality in image recognition came from convolutional neural networks (CNNs) [150]. There are a few pure transformer models applied to image recognition that are competitive with state-of-the-art CNN models. In this section, we focus on the Vision Transformer (ViT) [78], which was introduced to see how effective pure transformer models could be for computer vision.

Small object detection in UAV image based on improved YOLOv5

View Article

Journal Information

Published in Systems Science & Control Engineering, 2023

Jian Zhang, Guoyang Wan, Ming Jiang, Guifu Lu, Xiuwen Tao, Zhiyuan Huang

Recently, the Transformer module has made great achievements in vision, and the Vision Transformer (ViT) is proposed (Dosovitskiy et al., 2020), which is the first Transformer to be applied to a visual recognition task and achieved good detection results. In this paper, the last C3 module in the backbone extraction network is replaced with the Transformer combined with C3. The improved module improves the global information acquisition capability and enriches the contextual feature information compared with the C3 module of the original network. This makes the improved module better for the task of high-density object detection for UAV aerial images. The Transformer structure is shown in Figure 5, which contains two sub-layers. The first sublayer is a multi-headed attention layer and the second sublayer is a fully connected layer. Each sublayer is connected to the other using residuals.

ViT-TB: Ensemble Learning Based ViT Model for Tuberculosis Recognition

View Article

Journal Information

Published in Cybernetics and Systems, 2022

Lassaad Ben Ammar, Karim Gasmi, Ibtihel Ben Ltaifa

When it comes to natural language processing, the Transformer architecture proposed in (Yu et al. 2021) is at the cutting edge of new research (NLP). The success of self-attention-based deep neural networks of Transformer models in natural language processing inspired Dosovitskiy in (Vaswani et al. 2017) to develop the Vision Transformer (ViT) architecture for picture categorization (NLP). Training these models often involves breaking down the input image into its constituent fields and then treating each embedded field as though it were a word in an NLP system. To find the link between the hidden patches, these models use self-observation modules.

SRDD: a lightweight end-to-end object detection with transformer

View Article

Journal Information

Published in Connection Science, 2022

Yuan Zhu, Qingyuan Xia, Wen Jin

Convolutional Neural Network (CNN) (Krizhevsky et al., 2012) has become the dominant model for vision tasks since 2012, and more efficient structures have been developed in recent years. Transformer (Vaswani et al., 2017), which achieved great success in Natural Language Processing study, has gradually become a new research area for vision problems (Bai et al., 2021; C. Wang et al., 2020; Yan et al., 2021), called ViT (Vision Transformer). Unlike the complex detection structure in mainstream detectors, ViT turns object detection problem into a direct set-prediction problem (Carion et al., 2020). This method can simplify the detection process and eliminate many hand-designed components of previous detection algorithms, such as non-maximum suppression (NMS) or anchor boxes, thus eliminate a lot of computation consumption that were considered hard to parallelise. However, ViT models usually take more time to converge while training and has relatively low performance in detecting small objects since it is not a multi-scale structured network. To solve this problem, X. Zhu et al. (2021) proposed deformable attention inspired by the deformable convolution (Dai et al., 2017). By using deformable attention, Deformable DETR (Detection Transformer) addresses the slow convergence and high complexity issue of DETR, which enables the transformer encoder to use multi-scale features as input and significantly improves performance in small objects detection. At the same time, Zheng et al. (2021) proposed the ACT (Adaptive Clustering Transformer) to reduce the computational complexity of the attention module. Y. Wang et al. (2022) proposed an variation called Row-Column Decoupled Attention (RCDA) to solve the one region but multiple objects problem. And Conditional DETR (Meng et al., 2021) speeds up the convergence of DETR by explicitly finding the extremity region of the object.