Q-learning – Knowledge and References

Explore chapters and articles related to this topic

Reinforcement Learning for Out-of-the-box Parameter Control for Evolutionary and Swarm-based Algorithm

Published in Wellington Pinheiro dos Santos, Juliana Carneiro Gomes, Valter Augusto de Freitas Barbosa, Swarm Intelligence Trends and Applications, 2023

Marcelo Gomes Pereira de Lacerda

Q-Learning-based methods indirectly learn optimal policies by maximizing their state-action value functions, and not by directly evolving the policy itself. Policy Gradient Methods, on the other hand, directly learn the optimal policy by dynamically adjusting the parameters of the evolving policy (Lapan, 2018). In such methods, π(a|s,θ) = Pr{at = a|st = s,θt = θ} is the probability of taking an action a in a state s when the policy’s parameter vector is θ ∈ ℝd′, where d′ is the number of policy’s parameters. These algorithms learn θ through gradient ascent on the gradient of some scalar performance measure J(θ) (see Equation 22), thus maximizing the agent’s performance (Sutton et al., 1999).

Dynamic Graphical Games

View Chapter

Purchase Book

Published in Magdi S Mahmoud, Multiagent Systems, 2020

Magdi S Mahmoud

Q-Learning is a Reinforcement Learning algorithm [281] that does not need a model of its environment and can be used on-line. Q-learning algorithms operate by estimating the values of state-action pairs. The value Q(s, a) is defined as the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. After being initialized, Q-values are estimated on the basis of experience, as follows: From the current state s, select an action a. This will bring an immediate payoff r, and will lead to a next state s′,Update Q(s, a) based upon this experience as follows: Q(s,a)=(1−k)Q(s,a)+k(r+γmaxQ(s′,a′)) where k is the learning rate and 0 < γ < 1 is the discount factor.

Reinforcement Learning

View Chapter

Purchase Book

Published in Mark Chang, Artificial Intelligence for Drug Development, Precision Medicine, and Healthcare, 2020

Mark Chang

Q-learning (Watkins, 1989, Gosavi, 2003, Chang, 2010) is a form of model-free reinforcement learning. It is a forward-induction and asynchronous form of dynamic programming. The power of reinforcement learning lies in its ability to solve the Markov decision process without computing the transition probabilities that are needed in value and policy iteration. The key algorithm in Q-learning is the recursive formulation for the Q-value: ()Qi={Qi−1(s,a),otherwise,(1−αi)Qi−1(s,a)+αi[gi+γVi−1(s′i)]ifs=si,a=ai

DRL-based adaptive signal control for bus priority service under connected vehicle environment

View Article

Journal Information

Published in Transportmetrica B: Transport Dynamics, 2023

Xinshao Zhang, Zhaocheng He, Yiting Zhu, Linlin You

The Q-learning algorithm is to construct a Q table for states and actions before carrying out the action that provides the most advantage (Watkins 1989). The optimal state-action value function represents the expected payoff of doing action a in the state s and it follows Bellman's equation: The parameter is the discount factor, which balances the significance of the present and future returns. When , tends to converge and . The updated equation of the Q-value is: where is the immediate reward for the action , and α is the learning rate. Q-learning is an offline model for time series difference control algorithms that requires a behavioral strategy to select the current action, such as the ε-greed strategy (Hausknecht and Stone 2015).

Real-Time multi-objective optimization of safety and mobility at signalized intersections

View Article

Journal Information

Published in Transportmetrica B: Transport Dynamics, 2023

Passant Reyad, Tarek Sayed

A Q-table is usually used to store all state-action pairs in the context of Q-learning method. in Q-learning technique, the learning process is unsupervised. However, applying RL techniques to stochastic environments may lead to a Q-table with an infinite number of states. As a result, the agent will not be able to visit most of the states, which will affect the quality of the learning process. To solve this problem, function approximation can be applied by generalization from the previously visited states to the ones that the agent has never experienced. Two standard methods were included in the literature for the generalization approach: artificial neural networks and statistical curve fitting (e.g. Sutton and Barto1998; Abdulhai and Kattan 2003). The most widely used and applied artificial neural networks, typically involve supervised learning. An extensive set of training examples of inputs and their associated outcomes are required for training to cover all the possible range of environmental conditions. Generating such training sets is a challenging task in several cases, even for a domain expert. Therefore, dividing the states into ranges and including it in a simple look-up table can represent the possible states of the environment. This discretization method is used to cover all the possible states of the environment and solve the problem of having an infinite number of states (e.g. Wiering 2000; Abdulhai et al. 2003; Camponogara and Kraus 2003; Shoufeng et al. 2008; Salkham et al. 2008; Balaji et al. 2010; El-Tantawy et al. 2014).

A-DQRBRL: attention based deep Q reinforcement battle royale learning model for sports video classification

View Article

Journal Information

Published in The Imaging Science Journal, 2023

G. Srilakshmi, I. R. Praveen Joe

Where mentions the expected value of action at state , the policy is represented as , the discount factor in Q-learning is denoted as which evaluates the trade-off between the immediate reward and the prediction of future reward. The amount of possible actions is reduced during the feature-dropping process, and it is hard to mention the state as the input. The evaluation of frame network acquires the vector and dropping features which mention the relation between dropping features and other frames. Initially, the frame evaluation network is represented as a powerful feature and combined with a vector . The network takes these combined features to generate . The first-order statistics feature for a frame is evaluated as,