A Multi-Armed Bandit problem is a hypothetical experiment where an agent must choose between multiple actions, each with an unknown payout.
A -armed bandit provides actions, each with expected or mean rewards.
where is the expected reward (also referred to as the value) when action is selected, is the action selected at time step , and is the corresponding reward. is the estimated value of at time step .
Exploration vs Exploitation
Exploration is the task of trying out new actions to learn their rewards.
Exploitation is the task of preferring actions that appear to have the best rewards.
A good solution to the multi-armed bandit problem should have a good balance of exploration and exploitation.
-Greedy Algorithm
The -Greedy Algorithm involves choosing random exploration with probability and choosing exploitation with probability .
Incremental Implementation
We can use the average value of as an estimate for :
Pseudocode
Nonstationary Problems
For bandits that are nonstationary (i.e. they change over time), it makes sense to assign higher weight to more recently recorded rewards than to old ones. Thus, we can use a weighted average estimation method by using a defined step size :
If is varying in each step, the variation must conform to the following conditions in order to ensure converges to :
Pseudocode
Upper-Confidence-Bound Action Selection
Instead of selecting exploration or exploitation based on probability, it makes sense to explore more at the beginning and reduce exploration over time. Upper-Confidence-Bound Action Selection involves using an additional term that accounts for the uncertainty in the value of :
where denotes the number of times action has been selected and controls the degree of exploration.
Pseudocode
Gradient Bandit Algorithm
The Gradient Bandit Algorithm involves learning a numerical preference for each action and using that to determine the current action.
We define as the probability of taking action at time :
The learning algorithm for is based on stochastic gradient ascent: