Key Takeaways
- Multi-armed bandits (MAB) is a peculiar Reinforcement Learning (RL) problem that has wide applications and is gaining popularity.
- Multi-armed bandits extend RL by ignoring the state and try to balance between exploration and exploitation.
- Website design and clinical trials are some areas where MAB algorithms shine.
- Contextual bandits takes MAB further by adding an element of Context from the problem domain.
- Contextual bandits can improve effectiveness of MAB problems but also be applied to new areas like Recommender systems.
Machine Learning (ML) is often thought to be either Supervised (learning from labeled data) or Unsupervised (finding patterns in raw data). A less talked about area of ML is Reinforcement Learning (RL) – where we train an agent to learn by "observing" an environment rather than from a static dataset. RL is considered to be more of a true form of Artificial Intelligence (AI) – because it’s analogous to how we, as humans, learn new things – observing and learning by trial and error. Although RL has been studied by researchers for decades now, a particular formulation of RL that is fast gaining popularity is called Multi-armed bandits (MAB). In this article, we will explore Multi-armed Bandits and understand how these can be applied to areas like improving the design of websites and making clinical trials more effective.
As shown in figure 1 below, the traditional RL problem is modeled as an environment with a state (S1) which is observed by our agent and changed to state (S2) by taking an action (A). The action transitions state of the environment from (S1) to (S2) and in return the agent gets a reward (R). The reward may be positive or negative. Over a series of such trials and errors, the agent learns an optimal policy to take actions based on a state which maximizes the long-term rewards. An example of this could be a game of chess where actions were taken by the player to change the state of the board and there may be immediate rewards like killing or losing a piece and long term reward of winning or losing the game. RL is heavily used in the gaming industry and you can imagine these agents and environments becoming more and more complex.
Figure 1: Pure Reinforcement Learning
A simpler abstraction of the RL problem is the Multi-armed bandit problem. A multi-armed bandit problem does not account for the environment and its state changes. As shown in figure 2 below, here the agent only observes the actions it takes and rewards it receives and tries to devise the optimal strategy. The idea in solving multi-armed bandit problems is to try and explore the action space and understand the distribution of the unknown rewards function. The name “bandit” comes from the analogy of Casinos where we have multiple slot machines and we have to decide if we should continue playing a single machine (exploitation) or moving to a new machine (exploration) – all with the objective of maximizing our winnings. We don’t know anything about the state of these machines – only the actions we take and the rewards we get. Here we need to decide between multiple choices purely by taking actions and observing the returns. These algorithms ultimately try to do a trade-off between exploration and exploitation to identify the optimal strategy.
Figure 2: Multi-armed Bandit problem
Let’s understand multi-armed bandit algorithms using the example of website design. Say, we have a new version of our website (B) and need to decide how well it performs compared to the older version (A). Let’s assume that we have an effective feedback mechanism by which we know which version the user liked or disliked (reward). Our action here is to present the user with option A or B of the website, our environment is the user visiting the website, and reward is the feedback received. As mentioned earlier, by studying the response from the users we want to understand the distribution of the reward function and thus make an informed decision on which design is better.
The most obvious strategy will be to divide the users into 2 groups randomly and show website A to one group and B to the other. So if we have 100 users, we show a random group of 50 option A and other 50 option B. Then based on which website is liked more, we decide to go ahead with a design. This is the approach of pure exploration and is popularly known as A/B testing. A/B testing has its roots in inferential statistics where we want to accept or reject a hypothesis (assumption) with evidence. So if our hypothesis is that website B is liked more by users than website A, then from the data we try to accept or reject this with evidence. The drawback of this approach is that we wait till the end of the test – that is, till all 100 users have seen the website to make a decision. So if option B is really bad – we waste our valuable users time on that option.
Maybe for website testing, we could wait till we get enough feedback points before deciding which design is better. However, imagine a case of clinical trials. Here our hypothesis is that our new drug B is better than earlier drug A. We experiment on fixed user groups and observe the results. Now with a traditional A/B testing approach, we have to wait till we have tried both drugs on enough people before making a decision. In this process, we are subjecting our patients to an inferior drug only to collect data. This is highly unacceptable.
Multi-armed bandit solution algorithms try to optimize the exploration vs exploitation trade-off so that we can get maximum rewards and not waste our resources. Let’s look at a simple multi-armed bandit solution approach. Instead of dividing users into fixed groups, say for the first 10% of users we randomly show website A or B to get an idea of which is more liked – from the observed feedback. Then for the next 40%, we only show the website that we calculated as the preferred option. Again for 10% we do more exploration and rest. 40% of users show the winning option. This is a case of epsilon-Greedy algorithm were for a probability of epsilon (20% here) we do exploration. Rest of the time we exploit the winning option. The epsilon number can be tweaked over time based on how much exploration you want to do. Another approach looks at the uncertainty in the reward function that we model from sampling. For actions where we see more uncertainty in rewards, we explore more. There are also Bayesian methods for solving this problem where we use the prior probability of rewards and observed data to estimate the posterior probability. Many flavours of such algorithms exist for solving multi-armed bandit problems but in essence, they manage the exploration vs exploitation trade-off and develop a strategy for taking actions.
Let’s look at some code in Python to experiment with the above two approaches. All the code is available for online execution at my Google Colab Notebook.
We will first create a simple simulated multi-armed bandit problem simulator class. It will define a 2-armed bandit with each arm pull generating a reward of either 0 or 1. We will design the simulation such that arm1 will give us maximum rewards. We will draw 100 samples by drawing each arm and plot the distribution of rewards in figure 3 below.
import numpy as np
# simulate the bandit problem
class MABandit_Problem:
def __init__(self):
# we build two bandits
self.samples = {}
self.num_samples = 1000
# create samples for both arms
self.samples[0] = np.random.normal(0, 0.1, self.num_samples)
self.samples[0] = self.samples[0] < 0.1
self.samples[1] = np.random.normal(0.2, 0.3, self.num_samples)
self.samples[1] = self.samples[1] < 0.1
self.pointer = 0
# method for acting on the bandits - return reward
def pull_arm(self, k):
reward = 0
# arms should be 0 or 1
if k<0 or k>1:
return -1
# get the reward
reward = self.samples[k][self.pointer]
# update pointer
self.pointer += 1
# rotate the pointer
if self.pointer >= self.num_samples:
self.pointer = 0
return reward
Figure 3: Distribution of rewards for simulation
Next, we will run a test using the A/B testing strategy where we randomly explore the 2 arms and evaluate rewards. We measure the two arms by rate of reward generated. We see that arm 1 gives better rate (0.826), but to find this we end up running half of the tests through the lower rewards arm 2.
import random
# Classic A/B testing - Pure exploration
class AB_Testing:
# initialize bandit class
def __init__(self):
self.mab = MABandit_Problem()
self.total_runs = 1000
self.arm_flag = False
self.arm_rewards = [0, 0]
self.arm_counts = [0, 0]
self.arm_rew_rate = [0, 0]
# method for acting on the bandits - return reward
def run_test(self):
# for each experiment
for _ in range(self.total_runs):
# pull a random arm
arm = int(self.arm_flag)
self.arm_flag = True if not self.arm_flag else False
# calculate reward
self.arm_rewards[arm] += self.mab.pull_arm(arm)
self.arm_counts[arm] += 1
# calculate total rate of reward
self.arm_rew_rate = np.divide(self.arm_rewards, self.arm_counts)
print("Number of each arm pulls = ", self.arm_counts)
print("Rate of reward for each arm = ", self.arm_rew_rate)
print('-'*100)
print("Finding: Arm 1 gives maximum reward - we only end up running %d runs through Arm 2"%(self.arm_counts[1]))
# run the test
test = AB_Testing()
test.run_test()
RESULTS:
Number of each arm pulls = [500, 500]
Rate of reward for each arm = [0.826 0.354]
---------------------------------------------------
Finding: Arm 1 gives maximum reward - we only end up running 500 runs through Arm 2
Now, we will run the same test using an epsilon greedy policy. We will explore the arms 20% of time (epsilon = 0.2) and rest of time we will pull the arm with the maximum rewards rate – that is exploited. We see that using this approach we still find that arm 1 yields better rewards rate (0.832) but we only pass 95 data points through the other arm 2.
import math
import random
# Epsilon greedy - explore vs exploit trade-off
class EPS_Greedy_Testing:
def __init__(self):
self.mab = MABandit_Problem()
self.total_runs = 1000
self.epsilon = 0.2
self.arm_rewards = [0, 0]
self.arm_counts = [0, 0]
self.arm_rew_rate = [0, 0]
# method for acting on the bandits - return reward
def run_test(self):
# for each experiment
for _ in range(self.total_runs):
arm = -1
# pull arm following epsilon greedy policy
if random.random() < self.epsilon:
# exploration - randomly select arm
arm = random.randrange(2)
else:
# exploitation - choose max reward rate arm
arm = np.argmax(self.arm_rew_rate)
# add reward
self.arm_rewards[arm] += self.mab.pull_arm(arm)
self.arm_counts[arm] += 1
# calculate rate of reward
self.arm_rew_rate = np.divide(self.arm_rewards, self.arm_counts)
print("Number of each arm pulls = ", self.arm_counts)
print("Rate of reward for each arm = ", self.arm_rew_rate)
print('-'*100)
print("Finding: Arm 1 gives maximum reward - we only end up running %d runs through Arm 2"%(self.arm_counts[1]))
test = EPS_Greedy_Testing()
test.run_test()
RESULTS:
Number of each arm pulls = [905, 95]
Rate of reward for each arm = [0.8320442 0.35789474]
---------------------------------------------------
Finding: Arm 1 gives maximum reward - we only end up running 95 runs through Arm 2
From the above experiments, we see that the epsilon greedy algorithm finds a better arm with lesser data points being processed through the lower rewards arm. This was a simple example – but the same applies when we deal with many options to choose from – example multiple ads to show users or multiple web site designs or clinical trials for multiple options.
A drawback of multi-armed bandits is that they totally ignore the state of the environment. The state can provide very valuable insights that can help learn a policy much faster and more effectively. Incorporating some element of state from the problem environment has given rise to a new set of algorithms Contextual Bandits, shown in figure 4 below. Here, instead of randomly managing the exploration vs exploitation trade-off, we try to obtain some Context about the environment and use that to manage the action. Context here is different from the State we talked about for a traditional RL problem earlier. Context is just some knowledge that we have of the environment that helps us take action. For example, if the website B uses design elements that are known to be preferred by people under the age of 30, then for users under 30 we may choose to “exploit” this Context information and always show the website B.
Figure 4: Contextual Bandits
Contextual Bandits are extensively used in areas like a personalized advertisement to show relevant ads to consumers. We can see that in areas like website design and clinical trials also, having the right Context can go a long way in making our actions more effective. In the website design example, Context is typically modeled as a vector that contains information like user location, browser history, browser type, etc. For clinical trials, we can have even more granular information like patient age, sex, medical history, etc. For clinical trials, there may even be some pre-checks and screenings used to generate Context information that will help in the future to make an informed decision.
Contextual bandits are also being used in Recommendation Systems where we predict a list of items to recommend for a particular user. Examples are your product recommendations on Amazon or movie recommendations on Netflix. Traditionally, Netflix has been using systems like Content and Collaborative filtering for a recommendation. Content filtering tries to find movies similar to the ones the user has liked using features like genre, actors, country, release year, etc. Collaborative filtering uses a 'wisdom of crowd’ approach by finding users similar to current users and finding items they liked which current user will mostly like. Content filtering needs a lot of feature engineering and Collaborative filtering has a cold start problem. When we have new users or new movies, it’s difficult to get good recommendations. More recently, companies have started exploring Contextual and Multi-armed bandits for improving their movie recommendations. Having user Context is an invaluable asset that can help provide more relevant recommendations. This is particularly important when there is a constantly changing user database and items catalogue – as in Netflix or Amazon case.
The next generation of Intelligent Systems will need more than purely data-driven insights to be effective. The need of the hour is to augment Data with Context obtained from history or domain Knowledge. It’s time to build learning machines that are Context-aware and can tune their actions based on the environment they work in or for people they are meant to provide value.
About the Author
Dattaraj Jagdish Rao is the author of the book “Keras to Kubernetes: The journey of a Machine Learning model to Production”. Dattaraj leads the AI Research Lab at Persistent and is responsible for driving thought leadership in AI/ML across the company. He leads a team that explores state-of-the-art algorithms in Computer Vision, Natural Language Understanding, Probabilistic Programming, Reinforcement Learning, Explainable AI, etc. and demonstrates applicability in Healthcare, Banking and Industrial domains. Earlier, he worked at General Electric (GE) for 19 years building Industrial IoT solutions for Predictive Maintenance, Digital Twins and Machine Vision. He held several Technology Leadership roles at Global Research, GE Power and Transportation (now part of Wabtec). He led the Innovation team out of Bangalore that incubated video track-inspection from an idea into a commercial Product. Dattaraj has 11 patents in Machine Learning and Computer Vision areas. You can connect via Twitter.