Young Wu's Homepage

Prev: W1, Next: W3

# Overview

📗 Readings: RL Chapter 3.

📗 Wikipedia page: Link

# Multi-Armed Bandit

📗 Multi-Armed Bandits are MPDs with no states (or a single state) and random rewards, \(r_{t} \sim D\left(\mu = R\left(a_{t}\right)\right)\).

📗 There is a trade-off between experimentation and exploitation.

📗 A no-regret algorithm is one where the average loss of reward from a sequence of actions \(a_{t}\) (compared to the optimal action \(a^\star\)) as \(t \to \infty\) is \(0\).

➩ Epsilon greedy: at \(t\), pick a random action with probability \(\varepsilon\) and the empirically best action \(\hat{a} = \mathop{\mathrm{argmax}}_{a} \mu_{t}\left(a\right)\) where \(\mu_{t}\left(a\right) = \dfrac{1}{n_{t}\left(a\right)} \displaystyle\sum_{t} r_{t} 1_{\left\{a_{t} = a\right\}}\) and \(n_{t}\left(a\right) = \displaystyle\sum_{t} 1_{\left\{a_{t} = a\right\}}\).

➩ Epsilon first: in the first \(\varepsilon\) fraction of periods, pick a random action, and after that, pick the empirically best action at \(t\).

➩ Upper Confidence Bound (UCB): always pick the optimistically best action: \(\hat{a} = \mathop{\mathrm{argmax}}_{a} \mu_{t}\left(a\right) + c \sqrt{\dfrac{\log\left(t\right)}{n_{t}\left(a\right)}}\).

➩ Exponential-weight for Exploration and Exploitation (EXP3): pick action \(a\) with probability \(p_{t}\left(a\right) = \left(1 - \gamma\right) \dfrac{w_{t}\left(a\right)}{\displaystyle\sum_{a} w_{t}\left(a\right)} + \dfrac{\gamma}{\left| A \right|}\) and update the weights by \(w_{t+1}\left(a_{t}\right) = w_{t}\left(a_{t}\right) e^{\gamma \dfrac{r_{t}}{p_{t}\left(a_{t}\right) \left| A \right|}}\).

📗 UCB and EXP3 are no-regret algorithms. EXP3 is also no-regret against adversarial bandits (the choice is r_t is made by an adversary): this will be useful in game theory.

# Boxes

The arms have fixed but possibly different average rewards between 0 and 1. Click on one of them to view a random realization of the reward from the arm (i.e. a random number from a distribution with the fixed average). The goal is to maximize the total reward (or minimize the regret) given a fixed number of arm pulls. Which arm has the largest mean reward?

📗 Output:

➩ Number of pulls left:

➩ Current total reward:

➩ Number of pulls for each arms:

➩ Current means:

📗 Settings:

➩ Number of arms:

➩ Number of pulls:

➩ Standard deviation:

Last Updated: July 01, 2025 at 1:46 AM