Multi-armed bandit – Knowledge and References

Explore chapters and articles related to this topic

Reinforcement Learning Problems

Published in Chong Li, Meikang Qiu, Reinforcement Learning for Cyber-Physical Systems, 2019

The multi-armed bandit (MAB) problem, also referred to as the k-armed bandit problem, is one where the bandit has to pick among a discrete set of actions to maximize the expected return. The problem takes its name from a toy example where an agent has entered a casino and seeks to maximize earnings at a row of slot machines. However, upon entering the casino, the agent does not know which of the machines has the highest expected payout. The agent must therefore devise a strategy to both simultaneously learn the payout distribution for each slot machine and exploit his existing knowledge as to which machine is most lucrative. For now, we assume the payout distribution is stationary, that is, the payout distribution does not change over time. Despite its name, the MAB is constrained to play according to the following repetitive process: choose a slot machine to play, pull the selected machine’s lever, and observe the machine’s payout. As we build up to the full reinforcement learning problem, we want to point out some common elements and themes that emerge from the MAB problem that will also be common and critical to the RL problem.

Thought-leadership pieces

View Chapter

Purchase Book

Published in Nawal K. Taneja, 21st Century Airlines, 2017

Nawal K. Taneja

Determining the price for the bundle requires estimates of a customer’s willingness to pay, which should be calibrated for each trip segment. The customer’s willingness to pay is an estimate that can be derived from A/B testing or surveys such as conjoint analysis. In general, A/B testing is more effective than surveys, due to bias caused when customers do not provide feedback based on what they really think. With the estimates of price elasticity by trip segment, the discount for the bundle can be determined to arrive at the price for the bundle. A/B testing can also be used to improve conversion rates on the proposed bundles. In an A/B testing framework, alternate versions – current (control) versus proposed (variation) split the incoming traffic and can be compared against each other to determine whether there is a positive, negative or neutral impact on a metric such as conversion rates. An alternate approach to A/B testing is deployment of the multi-armed bandit11,12 which relies on trying out each alternative (arm) in an exploratory phase for a small percentage of the traffic (e.g. 10%) to find the best one and in the exploitation phase, sending the bulk of the traffic to the alternative (arm) that gives the best payoff.

Introduction

View Chapter

Purchase Book

Published in Joseph Y.-T. Leung, Handbook of SCHEDULING, 2004

Joseph Y.-T. Leung

[18] Gittins, J.C. (1989). Multi-armed Bandit Allocation Indices. John Wiley & Sons, Chichester. [19] Gordon, V.S., J.M. Proth, and C.B. Chu (2002). Due date assignment and scheduling: SLK, TWK and other due date assignment models. Production Planning o Control. 13, 117–132.

Contextual Bandit Approach-based Recommendation System for Personalized Web-based Services

View Article

Journal Information

Published in Applied Artificial Intelligence, 2021

Akshay Pilani, Kritagya Mathur, Himanshu Agrawald, Deeksha Chandola, Vinay Anand Tikkiwal, Arun Kumar

Multi-armed bandit problem is a classical problem in the field of Computer Science. Multi-armed bandit algorithms provide a solution to the exploration-exploitation dilemma. Contextual Bandits are a type of bandit which uses expected payoff to recommend news articles. The expected payoff of a user is calculated using context and unknown bandit parameters. Here, context is a feature vector obtained by using the information of both user and news articles. Some Contextual Bandit algorithms consider the users to be independent of each other, i.e. unknown bandit parameters are calculated for each user independently.