Prev: W1, Next: W3

# Overview

📗 Readings: RL Chapter 3.
📗 Wikipedia page: Link

# Multi-Armed Bandit

📗 Multi-Armed Bandits are MPDs with no states (or a single state) and random rewards, \(r_{t} \sim D\left(\mu = R\left(a_{t}\right)\right)\).
📗 There is a trade-off between experimentation and exploitation.
📗 A no-regret algorithm is one where the average loss of reward from a sequence of actions \(a_{t}\) (compared to the optimal action \(a^\star\)) as \(t \to  \infty\) is \(0\).
➭ Epsilon greedy: at \(t\), pick a random action with probability \(\varepsilon\) and the empirically best action \(\hat{a} = \mathop{\mathrm{argmax}}_{a} \mu_{t}\left(a\right)\) where \(\mu_{t}\left(a\right) = \dfrac{1}{n_{t}\left(a\right)} \displaystyle\sum_{t} r_{t} 1_{\left\{a_{t} = a\right\}}\) and \(n_{t}\left(a\right) = \displaystyle\sum_{t} 1_{\left\{a_{t} = a\right\}}\).
➭ Epsilon first: in the first \(\varepsilon\) fraction of periods, pick a random action, and after that, pick the empirically best action at \(t\).
➭ Upper Confidence Bound (UCB): always pick the optimistically best action: \(\hat{a} = \mathop{\mathrm{argmax}}_{a} \mu_{t}\left(a\right) + c \sqrt{\dfrac{\log\left(t\right)}{n_{t}\left(a\right)}}\).
➭ Exponential-weight for Exploration and Exploitation (EXP3): pick action \(a\) with probability \(p_{t}\left(a\right) = \left(1 - \gamma\right) \dfrac{w_{t}\left(a\right)}{\displaystyle\sum_{a} w_{t}\left(a\right)} + \dfrac{\gamma}{\left| A \right|}\) and update the weights by \(w_{t+1}\left(a_{t}\right) = w_{t}\left(a_{t}\right) e^{\gamma \dfrac{r_{t}}{p_{t}\left(a_{t}\right) \left| A \right|}}\).
📗 UCB and EXP3 are no-regret algorithms. EXP3 is also no-regret against adversarial bandits (the choice is r_t is made by an adversary): this will be useful in game theory.

# Boxes



The arms have fixed but possibly different average rewards between 0 and 1. Click on one of them to view a random realization of the reward from the arm (i.e. a random number from a distribution with the fixed average). The goal is to maximize the total reward (or minimize the regret) given a fixed number of arm pulls. Which arm has the largest mean reward?



📗 Output:
➭ Number of pulls left:
➭ Current total reward:
➭ Number of pulls for each arms:
➭ Current means:

📗 Settings:
➭ Number of arms:
➭ Number of pulls:
➭ Standard deviation:






Last Updated: May 07, 2024 at 12:22 AM