# Multi-Armed Bandit
📗 Multi-Armed Bandits are MPDs with no states (or a single state) and random rewards, \(r_{t} \sim D\left(\mu = R\left(a_{t}\right)\right)\).
📗 There is a trade-off between experimentation and exploitation.
📗 A no-regret algorithm is one where the average loss of reward from a sequence of actions \(a_{t}\) (compared to the optimal action \(a^\star\)) as \(t \to \infty\) is \(0\).
➩ Epsilon greedy: at \(t\), pick a random action with probability \(\varepsilon\) and the empirically best action \(\hat{a} = \mathop{\mathrm{argmax}}_{a} \mu_{t}\left(a\right)\) where \(\mu_{t}\left(a\right) = \dfrac{1}{n_{t}\left(a\right)} \displaystyle\sum_{t} r_{t} 1_{\left\{a_{t} = a\right\}}\) and \(n_{t}\left(a\right) = \displaystyle\sum_{t} 1_{\left\{a_{t} = a\right\}}\).
➩ Epsilon first: in the first \(\varepsilon\) fraction of periods, pick a random action, and after that, pick the empirically best action at \(t\).
➩ Upper Confidence Bound (UCB): always pick the optimistically best action: \(\hat{a} = \mathop{\mathrm{argmax}}_{a} \mu_{t}\left(a\right) + c \sqrt{\dfrac{\log\left(t\right)}{n_{t}\left(a\right)}}\).
➩ Exponential-weight for Exploration and Exploitation (EXP3): pick action \(a\) with probability \(p_{t}\left(a\right) = \left(1 - \gamma\right) \dfrac{w_{t}\left(a\right)}{\displaystyle\sum_{a} w_{t}\left(a\right)} + \dfrac{\gamma}{\left| A \right|}\) and update the weights by \(w_{t+1}\left(a_{t}\right) = w_{t}\left(a_{t}\right) e^{\gamma \dfrac{r_{t}}{p_{t}\left(a_{t}\right) \left| A \right|}}\).
📗 UCB and EXP3 are no-regret algorithms. EXP3 is also no-regret against adversarial bandits (the choice is r_t is made by an adversary): this will be useful in game theory.