Markov Decision Process – Knowledge and References

Explore chapters and articles related to this topic

Intelligent Systems

Published in R.S. Chauhan, Kavita Taneja, Rajiv Khanduja, Vishal Kamra, Rahul Rattan, Evolutionary Computation with Intelligent Systems, 2022

J. Senthil Kumar, G. Sivasankar

To handle human–robot interaction, RRT-based path planning alone is not enough to handle the situation, so we are proposing reinforcement learning (RL)-based RRT, where RL is used to learn the human–robot interaction through the trial-and-error method to achieve the desired performance. In RL, the agent earns positive and negative rewards for actions for each state. Markov decision process (MDP) is defined as a state, action, and the reward pair (s,a,r), where the state space is S ⊆ ℝN, the action space is A ⊆ ℝN, and the reward function is : S × A → ℝ. If action a to the state s, then the probability of the next state s' and the next reward r is given in Equation 6.4.

Flight Planning

View Chapter

Purchase Book

Published in Yasmina Bestaoui Sebbane, Multi-UAV Planning and Task Allocation, 2020

Yasmina Bestaoui Sebbane

The objective is to find a suitable policy that simultaneously minimizes the service delay and maximizes the information gained upon loitering. A stochastic optimal control problem is thus considered [376]. A Markov decision process (MDP) is solved in order to determine the optimal control policy [67,182]. However, its large size renders exact dynamic programming methods intractable. Therefore, a state aggregation-based approximate linear programming method is used instead, to construct provably good sub-optimal patrol policies [382]. The state space is partitioned, and the optimal cost-to-go or value function is restricted to be a constant over each partition. The resulting restricted system of linear inequalities embeds a family of MCs of lower dimension, one of which can be used to construct a lower bound on the optimal value function. The perimeter patrol problem exhibits a special structure that enables tractable linear programming formulation for the lower bound [27].

Security Concerns in Cognitive Radio Networks

View Chapter

Purchase Book

Published in Mohamed Ibnkahla, Cooperative Cognitive Radio Networks, 2018

Mohamed Ibnkahla

For the Markov decision process, a policy is defined by a mapping from a state to an action as πS(n) → a(n). This says that Wu–Wang–Liu’s policy π specifies an action π(S), which we shall take when we are in a certain state S. Out of all possible policies, the optimal policy is the one which maximizes the discounted payoff. Wu–Wang–Liu’s model defines the value of state S as the highest expected payoff given that the MDP started in state S: () V*(S)=maxπ(∑n=1∞δnU(n)|initialstateis S)

Leveraging vehicle connectivity and autonomy for highway bottleneck congestion mitigation using reinforcement learning

View Article

Journal Information

Published in Transportmetrica A: Transport Science, 2023

Paul (Young Joun) Ha, Sikai Chen, Jiqian Dong, Samuel Labi

Reinforcement learning, in essence, maximises the discounted reward of a finite-horizon Markov decision process (MDP). MDP is typically defined using a tuple (, where is the state space with states , is the action space with actions , a transition operator , and a scalar reward ). In RL, the agent receives feedback from the environment on its actions with the ultimate goal of maximising its cumulative discounted reward. The MDP provides the data to optimise a policy that maps states to actions that ultimately maximises the cumulative discounted reward (Bellman 1957; Sutton and Barto 2018).

Inspection strategies for quality products with rewards in a multi-stage production

View Article

Journal Information

Published in Journal of Control and Decision, 2022

R. Satheesh Kumar, A. Nagarajan

The contributions of the present paper are provided as follows: In this system, the arrival of the entity is Poisson with rate λ and the inter-service time among the stages of sequence processing has an Erlang distribution with parameters m and γ. The re-manufacturing time for non-conformity and packing time for conformity items are exponentially distributed with means and . Form the model as a Markov Decision Process (MDP) model and implementing the decisions in each state to maximize the total expected reward of the system. The remaining document is organized as follows. Section 2 describes the literature review of queue, re-manufacturing process and Application of MDP. Section 3 deals with Notations and Model descriptions in a stochastic environment and Section 4 describes the formulation of this model. A numerical example is given to understand the output of the model and discuss the real case study of the system in Section 5. Section 6 draws the conclusion and future research direction.

Deep reinforcement learning-based path planning of underactuated surface vessels

View Article

Journal Information

Published in Cyber-Physical Systems, 2019

Hongwei Xu, Ning Wang, Hong Zhao, Zhongjiu Zheng

When we study RL and control problems, an agent selects the control behaviour sequentially under a series of time in the unknown environment, in order to maximise a cumulative reward. Therefore, this is so-called the Markov decision process (MDP): a state space S, an action space A, a reward function and transition function . At each time , the agent observes the current state . Then, an action is selected to generate a transition from current state to new state . The agent gets an immediate reward in the process. In addition, for any trajectory in state–action space, the transition satisfies the Markovian property: