Multi-Armed Bandits

A Multi-Armed Bandit problem is a hypothetical experiment where an agent must choose between multiple actions, each with an unknown payout.

A k-armed bandit provides k actions, each with expected or mean rewards.

q(a)=E[Rt|At=a]

where q(a) is the expected reward (also referred to as the value) when action a is selected, At is the action selected at time step t, and Rt is the corresponding reward. Qt is the estimated value of q(a) at time step t.

Exploration vs Exploitation

Exploration is the task of trying out new actions to learn their rewards.

Exploitation is the task of preferring actions that appear to have the best rewards.

A good solution to the multi-armed bandit problem should have a good balance of exploration and exploitation.

ϵ-Greedy Algorithm

The ϵ-Greedy Algorithm involves choosing random exploration with probability ϵ and choosing exploitation with probability 1ϵ.

Incremental Implementation

We can use the average value of Ri as an estimate for Qn:

Qn+1=R0+R1+...+Rnn=Qn+1n[RnQn]

Pseudocode

Initialize, for a=0 to k1:Q=zeros(k)Estimated reward for each actionN=zeros(k)Number of times each action was chosenLoop:A={argmax(Q)with probability 1ϵrand_index(Q)with probability ϵR=bandit(A)N[A]=N[A]+1Q[A]=Q[A]+1N[A](RQ[A])

Nonstationary Problems

For bandits that are nonstationary (i.e. they change over time), it makes sense to assign higher weight to more recently recorded rewards than to old ones. Thus, we can use a weighted average estimation method by using a defined step size α:

Qn+1=Qn+α[RnQn]=(1α)nQ1+i=1nα(1α)niRi

If α is varying in each step, the variation must conform to the following conditions in order to ensure Q(a) converges to q(a):

n=1αn=andn=1αn2<

Pseudocode

Initialize, for a=0 to k1:Q=zeros(k)Estimated reward for each actionN=zeros(k)Number of times each action was chosenLoop:A={argmax(Q)with probability 1ϵrand_index(Q)with probability ϵR=bandit(A)N[A]=N[A]+1Q[A]=Q[A]+α(RQ[A])

Upper-Confidence-Bound Action Selection

Instead of selecting exploration or exploitation based on probability, it makes sense to explore more at the beginning and reduce exploration over time. Upper-Confidence-Bound Action Selection involves using an additional term that accounts for the uncertainty in the value of Q(a):

At=argmax[Qt(a)+clntNt(a)]

where Nt(a) denotes the number of times action a has been selected and c controls the degree of exploration.

Pseudocode

Initialize, for a=0 to k1:Q=zeros(k)Estimated reward for each actionN=zeros(k)Number of times each action was chosenLoop:A=argmax[Q(a)+clniN(a)]R=bandit(A)N[A]=N[A]+1Q[A]=Q[A]+α(RQ[A])

Gradient Bandit Algorithm

The Gradient Bandit Algorithm involves learning a numerical preference Ht(a)R for each action a and using that to determine the current action.

We define πt(a) as the probability of taking action a at time t:

πt(a)=Pr{At=a}=eHt(a)i=1keHt(i)

The learning algorithm for Ht(a) is based on stochastic gradient ascent:

Ht+1(a)=Ht(a)+α(RtR¯t)(1πt(a)),where a=AtHt+1(a)=Ht(a)α(RtR¯t)πt(a),aAt

or equivalently:

Ht+1(a)=Ht(a)+α(RtR¯t)(1a=Atπt(a)),a

where 1a=At is an indicator function:

1a=At{1ifa=At0ifaAt