Stochastic gradient descent – Knowledge and References

Explore chapters and articles related to this topic

Development of the image processing method for estimating axle load by use of AI

Published in Hiroshi Yokota, Dan M. Frangopol, Bridge Maintenance, Safety, Management, Life-Cycle Sustainability and Innovations, 2021

R. Koshimizu, Y. Sato

The update of the weight is performed by an “optimization algorithm” based on the back propagation method. In this study, “Momentum SGD” was used. Momentum SGD is a “SGD (Stochastic Gradient Descent)” method that is often used as an optimization method, with an inertia term. There are many methods for “activation function”, “loss function”, and “optimization algorithm”, and each has different characteristics. By capturing the detailed features between input and output by each intermediate layer, it is possible to express complex relationships. This theory itself has existed for a long time, but the accuracy could not be high because of the low computer performance. However, the dramatic improvement in computer performance makes it possible to increase the number of intermediate layers and neurons in each layer, thereby realizing the accuracy of today’s deep learning.

Opportunities and challenges in radiomics and radiogenomics

View Chapter

Purchase Book

Published in Ruijiang Li, Lei Xing, Sandy Napel, Daniel L. Rubin, Radiomics and Radiogenomics, 2019

Ruijiang Li, Yan Wu, Michael Gensheimer, Masoud Badiei Khuzani, Lei Xing

In a convolutional neural network, errors are back-propagated so that the gradient of every parameter can be calculated according to the chain rule.65 Given the gradients, parameters are updated using stochastic gradient descent algorithms. To speed up convergence and reduce the risk of falling into local minimum, algorithms that support adaptive learning rates are proposed, such as Momentum, RMSProp, and Adam. In Momentum, the current update is a linear combination of the gradient and the previous update.80 In RMSProp (root mean square propagation), the learning rate of every parameter is calculated from a running average of recent gradients for that weight.81 In Adam (adaptive moment estimation), adaptive learning rates are computed from estimates of first and second moments of the gradients.82 These optimization methods offer faster convergence rate than conventional stochastic gradient descent methods.

Artificial Neural Networks

View Chapter

Purchase Book

Published in Harry G. Perros, An Introduction to IoT Analytics, 2021

Harry G. Perros

In the stochastic gradient descent training method, we run the backpropagation algorithm for each data point in the training set D using the updated weights obtained from the previous data point; i.e., we start with a randomly generated initial set of the weights, and then we run the algorithm to obtain new values of the weights. Then, using the new weights, we run the algorithm for another data point in D. We iterate in this fashion until we use all the data points in D. The final cost EQ is the sum of all the individual costs. This method, also known as the online training method, is useful when the training data points are not available at the beginning but they become available one at a time.

The role of local steps in local SGD

View Article

Journal Information

Published in Optimization Methods and Software, 2023

Tiancheng Qin, S. Rasoul Etesami, César A. Uribe

Stochastic Gradient Descent (SGD) is one of the most commonly used algorithms for parameter optimization of machine learning models. SGD tries to minimize a function f by iteratively updating parameters as: , where is a stochastic gradient of f at and is the learning rate. However, given the massive scale of many modern ML models and datasets, and taking into account data ownership, privacy, fault tolerance, and scalability, distributed training approaches have recently emerged as a suitable alternative over centralized ones, e.g. parameter server [4], federated learning [7,12,20,25], decentralized stochastic gradient descent [1,10,15,31], decentralized momentum SGD [36], decentralized ADAM [21], among others [3,17,32].

Handwritten MODI Character Recognition Using Transfer Learning with Discriminant Feature Analysis

View Article

Journal Information

Published in IETE Journal of Research, 2023

Savitri Chandure, Vandana Inamdar

Result optimization is achieved by hyper parameter tuning while training. The stochastic gradient descent algorithm uses a subset of the training set which is of a mini-batch size to update the parameters. The batch size selected here is 64 and the activation function is ReLU. Learning Rate (LR) is a crucial factor for generalization of the network having a positive value in the range of 0–1. Experimentations are carried out with an increasing learning rate using the retrained network. Figure 3(a) shows the effect of the learning rate on the network performance. It shows that a slow learning rate leads to a very slow convergence and a faster rate results in an unstable network. Based on the performance curve, the learning rate selected is 0.001. Also, experimentation is continued by varying the number of epochs for a given learning rate and batch size. As shown in Figure 3(b), eight epochs give a better result, so the number of epochs selected is eight. Partitioning of the data samples into training set and testing set is also found to play a vital role. The best split found for a given dataset is 80% training and 20% testing.

A deep neural network approach for pedestrian trajectory prediction considering flow heterogeneity

View Article

Journal Information

Published in Transportmetrica A: Transport Science, 2023

Hossein Nasr Esfahani, Ziqi Song, Keith Christensen

This study used the stochastic gradient descent method as our optimisation algorithm with L2-norm regularisation to avoid overfitting the training data. The learning rate was set to 1.5. Since the number of data points belonging to individuals without disabilities is about 15 times that of individuals with disabilities, the importance of learning the latter can be undermined. Thus, we randomly oversampled from the data points that belonged to individuals with disabilities for training purposes (not for evaluation purposes) to match them in terms of quantity. The dropout method was also used for all fully connected layers to avoid overfitting and learning slowdown problems (with a rate of 0.5). The mini-batch size and the maximum number of epochs were 100 and 5,000, respectively. Regardless of the number of epochs, the algorithm terminates if the loss function drops below 0.0001 (this value is a lower bound for the MSE calculated for the normalised data). Table 3 summarises hyperparameters’ values for the proposed network. A system with Intel(R) Core(TM) i7-6700 CPU @ 3.40 GHz, 3408 Mhz, 4 Core(s), 8 Logical Processor(s) and 15.9 GB of physical memory was used to train and test the model. On this system, on average, a DEN's training run takes four days to complete; nonetheless, trained models can be applied to other datasets in a matter of minutes. Among all the models, DEN has the highest training run, and ANN has the lowest.