Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures) 21 id,answer_id;token,answer_check
# Warning: this is a draft and will be updated one day before the lecture.
📗 A simple reinforcement learning problem where the state does not change is called the multi-armed bandit: Wikipedia.
➩ Multi-armed bandits.
➩ Clinical trials.
➩ Stock selection.
📗 There is a set of actions \(1, 2, ..., K\), reward from action \(k\) follows some distribution with mean \(\mu_{k}\), for example normal distribution with mean \(\mu_{k}\) and fixed variance \(\sigma^{2}\), or \(r \sim N\left(\mu_{k}, \sigma^{2}\right)\).
📗 The agent's goal is to maximize the total reward from repeatedly taking an action in \(T\) rounds.
📗 An algorithm is called no-regret if as \(T \to \infty\), the regret approaches \(0\) with probability \(1\): Wikipedia.
In-class Discussion
ID:
📗 [1 points] The boxes have different mean rewards between 0 and 1. Click on one of them to collect the reward from the box. The goal is to maximize the total reward (or minimize the regret) given a fixed number of clicks. You have clicks left. Your current total reward is . Refresh the page to restart. Which box has the largest mean reward?
📗 There is a trade-off between taking actions for exploration vs exploitation:
➩ Exploration: taking actions to get more information (e.g. figure out the expected reward from each action).
➩ Exploitation: taking actions to get the highest rewards based on existing information (e.g. take the best action based on the current estimates of rewards).
📗 Epsilon-first strategy: \(\varepsilon T\) rounds of pure exploration, then use the empirically best in the remaining \(\left(1 - \varepsilon\right) T\) rounds.
➩ Empirically best action is \(\mathop{\mathrm{argmax}}_{k} \hat{\mu}_{k}\), where \(\hat{\mu}_{k}\) is the average reward from rounds where action \(k\) is used.
📗 Epsilon-greedy strategy: in every round, use the empirically best action with probability \(1 - \varepsilon\), and use a random action with probability \(\varepsilon\).
📗 The "best" action (based on current information) can also be defined based on the principle of optimism under uncertainty.
📗 An optimistic guess of the average reward (adjusted for uncertainty) in period \(t\) is \(\hat{\mu}_{k} + c \sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\), where \(n_{k}\) is the number of rounds action \(k\) is used.
➩ This term \(\sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\) represents the amount of uncertainty in the estimate \(\hat{\mu}_{k}\), the more action \(k\) is explored (higher \(n_{k}\)), the smaller the value of this term.
📗 The algorithm that always uses the action with the highest UCB is called the UCB1 algorithm: Wikipedia.
Math Note
📗 Technically, \(\sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\) is half of the width of the confidence interval computed based on Hoeffding's inequality: Wikipedia.
In-class Discussion
ID:
📗 [1 points] The boxes have different mean rewards between 0 and 1. Click on one of them to collect the reward from the box. The goal is to maximize the total reward (or minimize the regret) given a fixed number of clicks. You have clicks left. Your current total reward is . Refresh the page to restart. Which box has the largest mean reward?
📗 In the environment is adversarial, for example, the rewards are chosen by an adversary, then any deterministic algorithm would fail.
📗 EXP3 (EXPonential weight algorithm for EXPloration and EXPloitation) keeps track of a weight vector and select actions randomly based on the weights: Wikipedia
Math Notes
📗 The EXP3 weights are updated based on the rewards:
➩ Probability of choosing action \(k\): \(p_{k} = \left(1 - \varepsilon\right) \dfrac{w_{k}}{w_{1} + w_{2} + ... + w_{K}} + \dfrac{\varepsilon}{K}\).
➩ The updates are given by \(w_{k} = w_{k} e^{\dfrac{\varepsilon r_{t}}{p_{k} K}}\) when \(k\) is used in round \(t\). Other weights are not updated.
📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.
Additional In-class Discussion
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
Notes (not visible to other students):
Submit your answer to see other students answers (click the submit button to refresh):
Additional In-class Quiz
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
A.
B.
C.
D.
E.
Notes (not visible to other students):
Submit your answer to see other students answers (click the submit button to refresh):
📗 To get full points on the in-class quizzes for a lecture:
➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.
➩ Some questions require [notes] to earn the point.
➩ Some questions require special ID (given during the lecture) to earn the point.
➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.
➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.
➩ The grade on Canvas Assignment Q21 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.
📗 If there are any issues with submission on the website, please use this Google form: Link.
📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).
📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .