Proximal policy optimization – Knowledge and References

Explore chapters and articles related to this topic

Reinforcement Learning for Out-of-the-box Parameter Control for Evolutionary and Swarm-based Algorithm

Published in Wellington Pinheiro dos Santos, Juliana Carneiro Gomes, Valter Augusto de Freitas Barbosa, Swarm Intelligence Trends and Applications, 2023

Marcelo Gomes Pereira de Lacerda

One of the most widely known modern policy gradient algorithm in the literature is the Proximal Policy Optimization (PPO), proposed by Schulman et al., in 2017 (Schulman et al., 2017). The motivation of the authors of this work was to insert a regularization mechanism in VPG, which prevents new policies to be created too far from the previous one. It follows the same sequence of steps of VPG, but uses a different equation to calculate the loss to update the policy parameters θ. Equation 32 shows the function used in PPO for such an update, where ϵ is a hyperparameter the limits the distance between the new and the old policies, and g(ϵ,A(st,at)) returns (1 + ϵ)A(st,at) if A(st,at) ≥ 0, and (1 − ϵ)A(st,at) otherwise.

A bibliometric analysis and review on reinforcement learning for transportation applications

View Article

Journal Information

Published in Transportmetrica B: Transport Dynamics, 2023

Can Li, Lei Bai, Lina Yao, S. Travis Waller, Wei Liu

Further policy-based algorithms are also designed. For instance, Trust Region Policy Optimization (TRPO) (Schulman et al. 2015) is proposed, which tends to give monotonic improvement over iterations by constraining the Kullback–Leibler divergence between the old and updated policies so that the change of the entire parameter space will not be too large to avoid the collapse of state values caused by wrong decisions. Similarly, Proximal Policy Optimization (PPO) (Schulman et al. 2017) is a widely adopted algorithm to ensure the difference between the old and updated policies is also not too large by limiting the ratio between old and updated strategies under a hyper-parameter value.

Deep-reinforcement-learning-based gait pattern controller on an uneven terrain for humanoid robots

View Article

Journal Information

Published in International Journal of Optomechatronics, 2023

Ping-Huan Kuo, Chieh-Hsiu Pao, En-Yi Chang, Her-Terng Yau

With the rapid development of technology, several advancements have been observed in the field of robotics, with the most popular robotic application currently being mobile robots. Over the years, the development of robotics has progressed from robotic arms to robots with a variety of gait patterns, such as wheeled robots and quadruped robots. In this study, the type of robot discussed was a biped robot, whose gait pattern is highly similar to that of humans, and the model used was ROBOTIS OP3. Gait pattern, controlled by the body, was initially used by biologists to describe the movement pattern of an organism. However, this technique has been commonly applied to robots to record the movements of each robot leg and its gait pattern. This study focused on reinforcement learning (RL), a subset of machine learning whose aim is to explore how agents decide their actions depending on the environment (state) and what policies increase the reward. Given this concept, the present study also investigated how a humanoid robot can learn to walk on uneven terrain. Li et al.[1] studied the tracking and control of aircraft. However, the tracking data were difficult to obtain because of occlusion. Therefore, an estimator and a controller were used to solve this problem. In the present study, a controller was used to control the robot’s gait pattern. The robot’s data were collected using a sensor, and the controller was used to change the robot’s gait pattern. Yu et al.[2] used an RL boundary controller to address traffic congestion on highways. Proximal policy optimization (PPO), a neural network-based[3] policy gradient algorithm, was applied to the controller. Although the training was conducted with an analog model, it had an academic value. Hence, in the present study, PPO2 was used, which was derived from PPO, and RL was embedded in the controller.