Learning rate – Knowledge and References

Explore chapters and articles related to this topic

Predictive modeling, machine learning, and statistical issues

Published in Ruijiang Li, Lei Xing, Sandy Napel, Daniel L. Rubin, Radiomics and Radiogenomics, 2019

Panagiotis Korfiatis, Timothy L. Kline, Zeynettin Akkus, Kenneth Philbrick, Bradley J. Erickson

The key concept of a neural network is that while the initial set of values are random, errant outputs due to these random weights provide feedback that allow improvement in the weights. A popular method for improving the weights in a CNN is stochastic gradient descent (SGD) (Bottou 2011). SGD is simple to implement and fast even in large datasets. An important parameter when training a CNN is the learning rate. The learning rate defines the magnitude of change to the weights. If the learning rate is too fast, the CNN may never “settle” on a good solution. If it is too low, learning will be slow, and may get caught in a local minimum. Currently, there is not an easy way to select the learning rate, so grid search and cross validation are often used for this purpose. Learning rate schedulers have also been proposed (Bengio 2012).

A Review of the Predictive Modeling Process

View Chapter

Purchase Book

Published in Max Kuhn, Kjell Johnson, Feature Engineering and Selection, 2019

Max Kuhn, Kjell Johnson

The fitting procedure for neural network coefficients can be very numerically challenging. There are usually a large number of coefficients to estimate and there is a significant risk of finding a local optima. Here, we use a gradient-based optimization method called RMSProp31 to fit the model. This is a modern algorithm for finding coefficient values and there are several model tuning parameters for this procedure:32The batch size controls how many of the training set data points are randomly exposed to the optimization process at each epoch (i.e., optimization iteration). This has the effect of reducing potential overfitting by providing some randomness to the optimization process. Batch sizes between 10 and 40K were considered.The learning rate parameter controls the rate of descent during the parameter estimation iterations and these values were contrasted to be between zero and one.A decay rate that decreases the learning rate over time (ranging between zero and one).The root mean square gradient scaling factor (ρ) controls how much the gradient is normalized by recent values of the squared gradient values. Smaller values of this parameter give more emphasis to recent gradients. The range of this parameter was set to be [0.0, 1.0].

Machine learning methods for computational social science

View Chapter

Purchase Book

Published in Uwe Engel, Anabel Quan-Haase, Sunny Xun Liu, Lars Lyberg, Handbook of Computational Social Science, Volume 2, 2021

Richard D. De Veaux, Adam Eck

After a model (i.e., a particular organization of neurons) is chosen, additional hyperparameters are critical to the learning process. For instance, the learning rate determines how aggressively the fitting algorithm updates the parameters of the model: a higher learning rate might cause the model to converge more quickly and thus take less time for training, whereas a lower learning rate causes more conservative changes to the model and often slower convergerence to better parameters. Commonly, the optimal learning rate decreases as the number of layers in the network increases: values around 0.01–0.1 often work well for traditional neural networks (1–2 layers), wheres values around 0.00001–0.001 often work well for deep neural networks. The number of epochs to use during training determines the maximum number of times that the weights in the network will be updated: more epochs produce more changes, but also take more time to operate. For lower learning rates, it is often better to use more epochs due to the slower convergence of the weights. Often more important than the number of epochs is a special count that is frequently called patience, which determines for how many epochs training should continue after a new best performance (i.e., minimal loss) is achieved on the validation set. During these subsequent epochs, the fitting algorithm is trying to find a better set of weights than what was already achieved, and it gives up if better weights are not found quickly enough so that training does not continue too long. The value to use for patience depends largely on how much the weights change each epoch, but values of 25–50 epochs are not uncommon, as they strike a balance between optimistically hoping for improvement with not wasting time training more than is beneficial to the neural network.

An optimisation method for anti-blast performance of corrugated sandwich plate structure based on neural network and sparrow search algorithm

View Article

Journal Information

Published in Ships and Offshore Structures, 2023

Wei-Jian Qiu, Kun Liu, Shuai Zong, Tong-qiang Yu, Jia-xia Wang, Zhen-guo Gao

It can be seen from the graph that when the preset accuracy is 10−4, the traditional BP neural network model and GA-BP model do not reach the preset accuracy within the specified number of training, which shows that the training efficiency of BP neural network based on traditional gradient descent method is low. When the learning rate is small, the model cannot converge quickly. The Adam-BP model and GA-Adam-BP model can converge to the preset accuracy in less training times because of the adaptive Adam algorithm to adjust the learning rate. In addition, compared with the traditional BP neural network model, the GA-BP model can converge to a smaller loss value, indicating that the genetic algorithm can help the traditional BP neural network model out of local optimum trap and improve the learning ability of the model. Similarly, compared with the Adam-BP model, the GA-Adam-BP model converges faster, which demonstrates the optimisation effect brought by the genetic algorithm. By pre-screening the initial parameters, the network model can obtain relatively good parameters at the beginning of training. Therefore, no matter from the training accuracy or convergence speed, GA-Adam-BP model is the best, which shows that the performance of BP neural network is effectively improved after the modification.

HRUNET: Hybrid Residual U - Net for automatic severity prediction of Diabetic Retinopathy

View Article

Journal Information

Published in Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 2023

Deva Kumar Salluri, Venkatramaphanikumar Sistla, Venkata Krishna Kishore Kolli

Hyperparameters are a set of affectable parameters of how a model learns. The number of layers, functions of activation, learning rate, epochs, and other parameters are considered. The hyper-parameter configurations utilised in these pre-trained models are shown in Table 2. The authors determined the momentum of the learning rate and regularise factor via experimentation after multiple attempts. For all the pre-trained models and the proposed model, images are resized to 224 × 224, evaluated the models with hyperparameters as 30 epochs, batch size as 8 for the detection of DR severity. Different optimisers like Adam(), Adamax(), NADAM(), RMSprop() and SGD() are used with varying learning rates such as 0.001, 0.0001 and 0.00001 for model building.

Gravel road classification based on loose gravel using transfer learning

View Article

Journal Information

Published in International Journal of Pavement Engineering, 2022

Nausheen Saeed, Roger G. Nyberg, Moudud Alam

In this study, discriminative learning, also called cyclic learning, is used to determine the optimal learning rate for training layers of pre-trained CNN models. The learning rate is a hyperparameter that controls the speed at which the model learns. The weights are scaled by learning rate to minimise the loss. A lower learning rate might be a good way to avoid missing optimal solutions. At the same time, it could also mean that it will take more time to converge. Discriminative fine-tuning was introduced by Jeremy Howard and Sebastian Ruder (Howard and Ruder 2018). Their proposed method is that when different layers in a CNN capture different information, layers should be fine-tuned to different extents (Howard and Ruder 2018). Instead of using the conventional practice of increasing or decreasing learning rates, the model layers are grouped. The earlier layers can recognise general details such as lines, curves etc. These initial layers are helpful in most of the tasks. These layers are trained at a lower learning rate so that the model has more time to train on small details. Hence, the later layers are more task-specific and not useful, such as for gravel road condition classification. These later layers are therefore trained at a higher rate. This way, the weights of lower rates will be changed less than the later layers (Mushtaq et al.2021, F. Zhang et al.2021). The first phase layers, before the newly added classification layer, are frozen since they are already well trained; this will keep the weight unchanged during the training. In the second phase, all the layers are unfrozen and trained.