Multi-Armed Bandits

A Multi-Armed Bandit problem is a hypothetical experiment where an agent must choose between multiple actions, each with an unknown payout.

$\large k$ -armed bandit $\large k$ actions, each with expected or mean rewards.

q_{*} (a) = E [R_{t} | A_{t} = a]

$\large q_*(a)$ value $\large a$ $\large A_t$ $\large t$ $\large R_t$ $\large Q_t$ $\large q_*(a)$ $\large t$ .

Exploration vs Exploitation

Exploration is the task of trying out new actions to learn their rewards.

Exploitation is the task of preferring actions that appear to have the best rewards.

A good solution to the multi-armed bandit problem should have a good balance of exploration and exploitation.

$\large \epsilon$ -Greedy Algorithm $\large \epsilon$ $\large 1 - \epsilon$ .

$\large R_i$ $\large Q_n$ :

\begin{matrix} \begin{aligned} Q_{n + 1} & = \frac{R_{0} + R_{1} + . . . + R_{n}}{n} \\ = Q_{n} + \frac{1}{n} [R_{n} - Q_{n}] \end{aligned} \end{matrix}

\begin{matrix} \begin{aligned} Initialize, for a = 0 to k - 1 : \\ Q = zeros (k) \leftarrow Estimated reward for each action \\ N = zeros (k) \leftarrow Number of times each action was chosen \\ Loop: \\ A = {\begin{cases} argmax (Q) & with probability 1 - ϵ \\ rand_index(Q) & with probability ϵ \end{cases} \\ R = bandit (A) \\ N [A] = N [A] + 1 \\ Q [A] = Q [A] + \frac{1}{N [A]} (R - Q [A]) \end{aligned} \end{matrix}

$\large \alpha$ :

\begin{matrix} \begin{aligned} Q_{n + 1} & = Q_{n} + α [R_{n} - Q_{n}] \\ = (1 - α)^{n} Q_{1} + \sum_{i = 1}^{n} α (1 - α)^{n - i} R_{i} \end{aligned} \end{matrix}

$\large \alpha$ $\large Q(a)$ $\large q_*(a)$ :

\sum_{n = 1}^{\infty} α_{n} = \infty and \sum_{n = 1}^{\infty} α_{n}^{2} < \infty

\begin{matrix} \begin{aligned} Initialize, for a = 0 to k - 1 : \\ Q = zeros (k) \leftarrow Estimated reward for each action \\ N = zeros (k) \leftarrow Number of times each action was chosen \\ Loop: \\ A = {\begin{cases} argmax (Q) & with probability 1 - ϵ \\ rand_index(Q) & with probability ϵ \end{cases} \\ R = bandit (A) \\ N [A] = N [A] + 1 \\ Q [A] = Q [A] + α (R - Q [A]) \end{aligned} \end{matrix}

Upper-Confidence-Bound Action Selection $\large Q(a)$ :

A_{t} = argmax [Q_{t} (a) + c \sqrt{\frac{\ln t}{N_{t} (a)}}]

$\large N_t(a)$ $\large a$ $\large c$ controls the degree of exploration.

\begin{matrix} \begin{aligned} Initialize, for a = 0 to k - 1 : \\ Q = zeros (k) \leftarrow Estimated reward for each action \\ N = zeros (k) \leftarrow Number of times each action was chosen \\ Loop: \\ A = argmax [Q (a) + c \sqrt{\frac{\ln i}{N (a)}}] \\ R = bandit (A) \\ N [A] = N [A] + 1 \\ Q [A] = Q [A] + α (R - Q [A]) \end{aligned} \end{matrix}

The Gradient Bandit Algorithmnumerical preference $\large H_t(a) \in \mathbb{R}$ $\large a$ and using that to determine the current action.

$\large \pi_t(a)$ $\large a$ $\large t$ :

\begin{matrix} \begin{aligned} π_{t} (a) & = Pr {A_{t} = a} \\ = \frac{e^{H_{t} (a)}}{\sum_{i = 1}^{k} e^{H_{t} (i)}} \end{aligned} \end{matrix}

$\large H_t(a)$ is based on stochastic gradient ascent:

\begin{matrix} \begin{aligned} H_{t + 1} (a) = H_{t} (a) + α (R_{t} - {\bar{R}}_{t}) (1 - π_{t} (a)), & where a = A_{t} \\ H_{t + 1} (a) = H_{t} (a) - α (R_{t} - {\bar{R}}_{t}) π_{t} (a), & \forall a \neq A_{t} \end{aligned} \end{matrix}

or equivalently:

H_{t + 1} (a) = H_{t} (a) + α (R_{t} - {\bar{R}}_{t}) (1_{a = A_{t}} - π_{t} (a)), \forall a

$\large 1_{a=A_t}$ is an indicator function:

\begin{matrix} 1_{a = A_{t}} {\begin{cases} 1 & if a = A_{t} \\ 0 & if a \neq A_{t} \end{cases} \end{matrix}