Self-play – Knowledge and References

Explore chapters and articles related to this topic

Friend or frenemy? The role of trust in human-machine teaming and lethal autonomous weapons systems

Published in Ash Rossiter, Robotics, Autonomous Systems and Contemporary International Security, 2020

To understand what some of the numbers could involve, consider the power required to train and then run the AlphaZero algorithm, developed by Google DeepMind. In 2017, AlphaZero used self-play reinforcement learning to train ‘itself’ to play chess and defeat the world champion in four hours. Some of the hardware required was 5,000 first-generation Tensor Processing Units (TPUs) – Google’s custom-designed, application-specific integrated circuit. This was complimented by 64 second-generation TPUs for training the algorithm’s neural networks.60 After ‘it’ completed this learning, the AlphaZero algorithm only needed 4 TPUs, comparatively little computing power, to play chess during a competition. For context, the first-generation TPU was 15–30 times more powerful, as well as 30–80 times more energy efficient (as measured by performance per watt), than the benchmark Graphics Processing Units and Central Processing Units that were available in 2015.61 As an alternative to volunteer computing, which requires large public appeal for scientific research,62 anyone with the right amount of cash can train an algorithm on a platform, such as Google Cloud,63 provided they meet the Terms of Service. The price for the public to ‘rent’ a single second-generation TPU in December 2019 began at 1.35 USD per hour.64 This raises the question of whether the increasing availability of cloud-based infrastructure could remove a barrier for anyone – including non-state actors – to do intensive training for algorithms.

Know Where to Start – Select the Right Project

View Chapter

Purchase Book

Published in James Luke, David Porter, Padmanabhan Santhanam, Beyond Algorithms, 2022

James Luke, David Porter, Padmanabhan Santhanam

However, Google Deep Mind’s AlphaZero program can play championship level chess today with NO human input, but just using the game rules augmented by self-play with reinforcement learning [10]. Given the advantage demonstrated recently by AlphaZero over Starfish (the best current chess program in the DeepBlue genre) [11], if we were to create a chess application today, we would be foolish not to consider ML based approaches.

Ancillary mechanism for autonomous decision-making process in asymmetric confrontation: a view from Gomoku

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2022

Chen Han, Xuanyin Wang

The agent’s different tendency of offensive and defensive learning is the main reason for the inefficiency of self-play. Broadly speaking, reinforcement learning, whether Q-learning (Schmidhuber, 2015) or the subsequent more efficient DQN (Mnih et al., 2016) its kernel is Trial-and-error methods. Through a lot of attempts, and strengthening the behaviour of obtaining positive reward, the ultimate goal is to achieve machine learning without human experience. The same as any other board game, winning or losing in Gomoku can provide a set of absolutely correct rewards.4 By providing positive and negative rewards to the agent during self-play, the agent can gradually learn to seek advantages and avoid disadvantages, thereby mastering the game. However, in practice, the internal difference between offensive and defensive learning through the trial-and-error methods, will result in a significant distinction in learning efficiency (Garc´ıa et al., 2020). In asymmetric games like Gomoku, such differences in learning rates will cause a detrimental impact on the machine learning training process.