Explore chapters and articles related to this topic
Model-Free Reinforcement Learning
Published in Chong Li, Meikang Qiu, Reinforcement Learning for Cyber-Physical Systems, 2019
However, in most cases, it is hard for the agent to know the environment before interacting with it. Specifically, the agent has no clue how the environment will response to its action and what immediate reward it will receive for doing so. For example, a navigating robot explores an unknown underground caves; an autonomous driving vehicle is designed to avoid all kinds of unpredictable collisions on the roads. This challenging fact leads to the most merit of reinforcement learning, model-free reinforcement learning, which is completely apart from a planning problem. The basic idea of model-free reinforcement learning is that the agent tries to take some actions based on its historical experience, observe how the environment responses, and then find an optimal policy in the long run.
Information-Centric Sensor Networks for Cognitive IoT
Published in Fadi Al-Turjman, Cognitive Sensors and IoT, 2017
In model-free reinforcement learning, the agent is free to learn from the environment by exploring it completely on its own. The agent learns from the positive reinforcement it gets for moving toward a goal and negative reinforcement for moving away from the goal. Q-learning is a form of model-free RL in which the learning agent converges to an optimal policy even if it were acting suboptimally. This is called off-policy learning. This is the most extensively used form of RL, as it is easy to implement and a relatively low-cost solution. However, Q-learning has its limitations too. The agent has to explore enough and eventually make the learning rate small but not decrease it too quickly, so that it has a large enough state space that covers all possible actions and policies.
Reinforcement Learning
Published in Mark Chang, Artificial Intelligence for Drug Development, Precision Medicine, and Healthcare, 2020
Q-learning (Watkins, 1989, Gosavi, 2003, Chang, 2010) is a form of model-free reinforcement learning. It is a forward-induction and asynchronous form of dynamic programming. The power of reinforcement learning lies in its ability to solve the Markov decision process without computing the transition probabilities that are needed in value and policy iteration. The key algorithm in Q-learning is the recursive formulation for the Q-value: ()Qi={Qi−1(s,a),otherwise,(1−αi)Qi−1(s,a)+αi[gi+γVi−1(s′i)]ifs=si,a=ai
An adaptive algorithm for consensus improving in group decision making based on reinforcement learning
Published in Journal of the Chinese Institute of Engineers, 2022
Zhang Hengsheng, Zhu Rui, Wang Quantao, Shi Haobin, Kao-Shing Hwang
As a typical algorithm in reinforcement learning, Q-learning is usually used to achieve automatic control. Benefiting from advanced deep learning technologies, Q-learning has been extended into continuous space (Henderson et al. 2018) with high interpretability (Li, Shi, and Hwang 2021). Q-learning is a model-free reinforcement learning algorithm based on Behrman equation, it uses the reward of the state-action pair and Q(s,a) as estimation functions in iteration. Therefore, the Agent needs to examine every action in every learning iteration, which can ensure the convergence of the learning process. The basic form of the Q-learning algorithm is as follows:
Asynchronous n-step Q-learning adaptive traffic signal control
Published in Journal of Intelligent Transportation Systems, 2018
We propose reinforcement learning as the appropriate method for traffic signal control because it offers specific advantages compared to other solutions. First, reinforcement learning utilizes the structure of the considered problem, observing what actions in certain states achieve reward (Sutton & Barto, 1998). This is in contrast to other optimization techniques, such as evolutionary methods (e.g. genetic algorithms; Lee, Abdulhai, Shalaby, & Chung, 2005), which ignore the information yielded from individual environment interactions. Second, Q-learning is a model-free reinforcement learning algorithm, requiring no model of the environment once it has been sufficiently trained. Contrasted with many traffic signal control optimization systems (e.g. SCOOT – Hunt, Robertson, Bretherton, & Winton, 1981; SCATS – Lowrie, 1990; RHODES – Mirchandani & Head, 2001) which require traffic models, or control theory approaches (Gregoire, Qian, Frazzoli, De La Fortelle, & Wongpiromsarn, 2015; Timotheou, Panayiotou, & Polycarpou, 2015; Wongpiromsarn, Uthaicharoenpong, Wang, Frazzoli, & Wang, 2012) which use other models, such as backpressure, model-free reinforcement learning makes no assumptions about the model as no model is required. This arguably makes reinforcement learning more robust to traffic dynamics; model-based methods require their models accurately reflect reality. As the disparity between the model and reality increases, the performance of the system suffers. Model-free reinforcement learning is parsimonious compared to model-based methods, requiring less information to function.
A deep reinforcement learning based hyper-heuristic for modular production control
Published in International Journal of Production Research, 2023
Marcel Panzer, Benedict Bender, Norbert Gronau
To facilitate the fundamental integration capability of deep RL to production control, the problem under consideration must satisfy the Markov property and correspond to a Markov Decision Process (MDP). Besides the rigid definition of the considered scope, the Markov assumption must be met, which implies that all future production states only depend on the current state. This constitutes the underlying assumption of our approach and of the later designed discrete-event based simulation (Sutton and Barto 2017). Q-learning is a variant of RL, which is a model-free, off-policy RL algorithm that exploits an action- or Q-value function. The Q-value function, see Equation (1), is typically defined based on an agent's expected cumulative reward in Equation (2), which follows its current policy, derived from the Bellman equation Bellman. In following functions resembles the Q-value for a state and action at a certain time t. r is the immediate reward received after taking action a in state , γ is the discount factor, and is the maximum Q-value over the next states and actions that can be executed from state (Sutton and Barto 2017).