Temporal attention – Knowledge and References

Explore chapters and articles related to this topic

Unmanned Aerial Vehicle Navigation Using Deep Learning

Published in Fei Hu, Xin-Lin Huang, DongXiu Ou, UAV Swarm Networks, 2020

Yongzhi Yang, Kenneth G. Ricks, Fei Hu

DRQN is the model used for estimating Q-values. The recurrent network possesses the ability to learn temporal dependencies by using information from an arbitrarily long sequence of observations, while the temporal attention weighs each of the recent observations based on their importance in decision making [26]. DRQN with LSTM [34] approximates the Q-value as Q(ot,ht−1,at)

Attention-based 3D convolutional networks

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2023

Enjie Ding, Dawei Xu, Yingfei Zhao, Zhongyu Liu, Yafeng Liu

Given an input, the temporal and spatial attention modules were added to it to know where the action takes place. Experiments show that the temporal attention module is more effective than the spatial attention module.

Multi-feature fusion refine network for video captioning

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2022

Guan-Hong Wang, Ji-Xiang Du, Hong-Bo Zhang

With the development of deep learning, many sequence learning methods are booming. Inspired the neural machine translation (NMT) Venugopalan, Huijuan. et al. (2015) and Ran. et al. (2015) exploited encoder-decoder framework, in which recurrent neural networks were used to translate videos to sentences. (Venugopalan, Rohrbach et al., 2015) adopted encoder-decoder framework to learn temporal structure of the sequence of frames and generate sentences. But they averaged each individual frame feature so that the temporal dynamics of video sequences was not well captured. (Li. Yao et al., 2015) introduced a temporal attention mechanism to automatically select the most relevant temporal segments of the visual input, which can capture the temporal dynamics. To further leverage more temporal information of video, Baraldi et al. (2017) and Haonan. Yu et al. (2016) designed hierarchical recurrent neural networks. In addition, Song et al. (2017) proposed a hierarchical Long Short-Term Memory (LSTM) (Sepp. & Schmidhuber, 1997) with adjusted temporal attention which can simultaneously consider both visual information and language context information to support the video caption generation. (Xiang. et al., 2018) proposed LSTMs with two multi-faceted attention layers to jointly leverage multiple sorts of visual features and semantic attributes. Gan et al. (2017) proposed factored LSTM to produce attractive visual captioning with the desired style. Pan et al. (2016) proposed a model to simultaneously explore the learning of LSTM and visual-semantic embedding. Pan et al. (2017) proposed LSTMs with transferred semantic attributes to incorporate the transferred semantic attributes into video captioning task. And Gan et al. (2017) used the semantic concepts detected from images to develop a semantic compositional network for image captioning. In addition, Duan et al. (2018) introduced weakly supervised dense event captioning which aimed at dense event captioning only using the captioning annotations for training. At the same time, Li et al. (2018) integrated descriptiveness regression into a single shot detection structure to infer the descriptive complexity of each detected proposal. Chen et al. (2018) proposed a plug-and-play PickNet to perform informative frame picking and reduce unnecessary computation cost in video captioning. Chen et al. (2019) employed convolutions in both encoder and decoder networks for video captioning to solve vanishing/exploding gradient problem in normally RNNs. Jiang (2019) exploited pre-fusion and post-fusion to fuse multi-feature, and then several LSTMs were used to generate the natural language sentences. Pan et al. (2020) introduced an X-Linear attention block which fully employed bilinear pooling to selectively capitalise on visual information or perform multi-modal reasoning.