Vanishing gradient problem – Knowledge and References

Explore chapters and articles related to this topic

Basic Approaches of Artificial Intelligence and Machine Learning in Thermal Image Processing

Published in U. Snekhalatha, K. Palani Thanaraj, Kurt Ammer, Artificial Intelligence-Based Infrared Thermal Image Processing and Its Applications, 2023

U. Snekhalatha, K. Palani Thanaraj, Kurt Ammer

ResNet V2 is an ANN that is widely used for image classification and regression. It utilizes the skip connection mechanism by jumping over some layers. The main reason for skipping is to avoid the vanishing gradient problem by making use of activations from a previous layer until the next layer learns its weight. This residual learning and skip connections enable the ResNet V2. In ResNet, each convolution is followed by batch normalization and non-linear activation functions. The ResNet V2 net empowers the features which result in increased accuracies during the classification. The pre-trained ResNet V2 is modified using the transfer learning approach where the earlier layers are fixed to overcome the overfitting issues and the higher-level portions of the pre-trained net are fine-tuned for this current study. In the modified ResNet V2 net, the filter convolution layer and the inception ResNet layers are fixed and the later layers are fine-tuned by adding a global averaging pooling (GAP) layer followed by three FCLs and a softmax layer.

Understanding and Building Generative Adversarial Networks

View Chapter

Purchase Book

Published in Monika Mangla, Subhash K. Shinde, Vaishali Mehta, Nonita Sharma, Sachi Nandan Mohanty, Handbook of Research on Machine Learning, 2022

Harsh Jalan, Dakshata Panchal

The Vanishing Gradient Problem occurs when the gradient updates of a NN model get too small to pass any information in the backpropagation step and the model essentially stops learning at a premature stage of training. In the case of GANs, vanishing gradients may occur in different parts of the architecture due to different reasons. The discriminator may suffer from vanishing gradient problem due to the choice of activation functions (AFs) used in its hidden layers, where using logistic AFs makes the model more prone to vanishing gradients problem than using AFs like ReLU, eLU, and Leaky ReLU. The generator, on the other hand, suffers from vanishing gradients when the discriminator gets too good at differentiating real data from fake data, thus reducing the gradients of the generator, which are backpropagated through the discriminator (see Generator Loss Function), to a negligible magnitude causing them to vanish.

Handwritten Character Recognition for Palm-Leaf Manuscripts

View Chapter

Purchase Book

Published in Sk Md Obaidullah, KC Santosh, Teresa Gonçalves, Nibaran Das, Kaushik Roy, Document Processing Using Machine Learning, 2019

Papangkorn Inkeaw, Jeerayut Chaijaruwanich, Jakramate Bootkrajang

In segmentation-free character recognition, a word image can be sent directly as an input to the classifier. A popular recognition model in this approach is the combination of a recurrent neural network-long short-term memory (RNN-LSTM) [60] and a connectionist temporal classification (CTC) [61]. RNNs are similar to the feed-forward neural networks with the exception that they can use the internal state (memory) to process sequences of inputs. However, the vanishing gradient problem, where the gradients of the weight vectors become too small to be useful for learning, is often encountered during the training phase. LSTM architecture was then introduced to deal with the vanishing gradient problem. The LSTM network adds multiplicative gates and additive feedback to the RNN [60]. Recent studies on word recognition using the RNN-LSTM can be found in [62,63].

DNN based approach to classify Covid’19 using convolutional neural network and transfer learning

View Article

Journal Information

Published in International Journal of Computers and Applications, 2022

Bhavya Joshi, Akhilesh Kumar Sharma, Narendra Singh Yadav, Shamik Tiwari

ReLu function eliminates the vanishing gradient problem that was present in other earlier activation functions. It does this by rectifying the inputs which are less than zero and making them zero. By using the rectified linear units, faster computation is achieved since they do not compute exponentials and divisions. This speeds up the computational rates [31]. However, there is a drawback associated with ReLU which is that, when compared with other functions like sigmoid, it easily overfits. To reduce this problem of overfitting of ReLUs, the dropout method is used. This leads to improved performances of the deep architecture when adding the rectified networks [32,33]. Thus, dropout layers are added after the dense layers. Finally, a fully connected dense layer is added to get the output as Covid-19 positive or negative. The softmax activation function is added as the last layer to find the probability of the image belonging to one class. The softmax function can be defined as in Equation (11). where j goes from 1 to the number of classes, K.

Designing a lightweight 1D convolutional neural network with Bayesian optimization for wheel flat detection using carbody accelerations

View Article

Journal Information

Published in International Journal of Rail Transportation, 2021

Dachuan Shi, Yunguang Ye, Marco Gillwald, Markus Hecht

The skip connection in the residual network [50] has been proved to be very effective for a deep network. An empirical study [51] reveals that the skip connection preserves gradient flow by shortcutting the long paths of a deep network, instead of solving the vanishing gradient problem. The real effective forward path of a deep network is actually much shorter the designed one. Shortcutting results in different possible paths during the training process, so that the entire network can be regarded as an ensemble of many paths, rather than a single deep path. This ensemble effect allows that the skip connection also has a positive effect on a normal network for the MFD task, which is not really deep as the ones proposed for computer vision applications (could be more than 100 layers). To achieve the skip connect, the input layer must have the same dimensionality as the last convolution layer before the addition. The left graph in Figure 4 illustrates the standard residual block. Given the input layer of size , there should be filters within the subsequent convolution layers (Although two convolution layers are shown in the graph for illustration, there could be more convolution layers). The output of the last convolution layer should have the same size of (meaning that convolution should be executed with stride one and padding) so that it can be added with the input and then transformed by the activation function.

MPC policy learning using DNN for human following control without collision

View Article

Journal Information

Published in Advanced Robotics, 2018

N. Hirose, R. Tajima, K. Sukigara

Figure 4 demonstrates that the three- and four-layer neural networks achieve a lower mean square error with more nodes in the hidden layers. In both of these cases, the error nearly reaches the minimum values at 100 nodes ( 100). In addition, the four-layer neural network achieved a smaller mean square error than the three-layer network. In contrast, the five-layer neural network showed worse performance than the four-layer neural network with 100, 500, and 1000 nodes in the hidden layers, despite the five-layer network having more degrees of freedom. It is expected that the vanishing gradient problem occurred during the training progress. Pre-training based on the autoencoder can be used to solve the vanishing gradient problem. However, the four-layer neural network with 100 nodes can achieve sufficient performance to mimic the original model predictive controller. A smaller neural network is adequate to reduce the calculation time, as long as its network can achieve the required performance. Therefore, the four-layer neural network with 100 nodes was selected for experimental and numerical evaluation to decrease the computational load.