Young Wu's Homepage

Prev: L20, Next: L22

Zoom: Link, Piazza: Link, Google Form: Link.

Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures)

# Warning: this is a draft and will be updated one day before the lecture.

Slide:

# Reinforcement Learning

📗 An agent interacts with an environment and receives a reward (or incurs a cost) based on the state of the environment and the agent's action.

📗 The goal of reinforcement learning is to maximize the cumulative reward by learning the optimal actions in every state.

📗 Unlike search problems, the agent needs to learn the reward or cost function.

➩ Board games: Link.

➩ Video games: Link.

➩ Robotic control: Link, Link.

➩ Autonomous vehicle control: Link.

➩ Economic models: Link.

➩ Large language models (RLHF: Reinforcement Learning from Human Feedback): Link.

# Multi Armed Bandit

📗 A simple reinforcement learning problem where the state does not change is called the multi-armed bandit: Wikipedia.

➩ Multi-armed bandits.

➩ Clinical trials.

➩ Stock selection.

📗 There is a set of actions \(1, 2, ..., K\), reward from action \(k\) follows some distribution with mean \(\mu_{k}\), for example normal distribution with mean \(\mu_{k}\) and fixed variance \(\sigma^{2}\), or \(r \sim N\left(\mu_{k}, \sigma^{2}\right)\).

📗 The agent's goal is to maximize the total reward from repeatedly taking an action in \(T\) rounds.

➩ Reward maximization: \(\displaystyle\max_{a_{1}, a_{2}, ..., a_{T}} \left(r_{1}\left(a_{1}\right) + r_{2}\left(a_{2}\right) + ... + r_{T}\left(a_{T}\right)\right)\).

➩ Regret minimization: \(\displaystyle\min_{a_{1}, a_{2}, ..., a_{T}} \left(\displaystyle\max_{k} \mu_{k} - \dfrac{1}{T} \left(r_{1}\left(a_{1}\right) + r_{2}\left(a_{2}\right) + ... + r_{T}\left(a_{T}\right)\right)\right)\).

📗 An algorithm is called no-regret if as \(T \to \infty\), the regret approaches \(0\) with probability \(1\): Wikipedia.

In-class Discussion

ID:

📗 [1 points] The boxes have different mean rewards between 0 and 1. Click on one of them to collect the reward from the box. The goal is to maximize the total reward (or minimize the regret) given a fixed number of clicks. You have clicks left. Your current total reward is . Refresh the page to restart. Which box has the largest mean reward?

📗 Answer: .

# Exploration vs Exploitation

📗 There is a trade-off between taking actions for exploration vs exploitation:

➩ Exploration: taking actions to get more information (e.g. figure out the expected reward from each action).

➩ Exploitation: taking actions to get the highest rewards based on existing information (e.g. take the best action based on the current estimates of rewards).

📗 Epsilon-first strategy: \(\varepsilon T\) rounds of pure exploration, then use the empirically best in the remaining \(\left(1 - \varepsilon\right) T\) rounds.

➩ Empirically best action is \(\mathop{\mathrm{argmax}}_{k} \hat{\mu}_{k}\), where \(\hat{\mu}_{k}\) is the average reward from rounds where action \(k\) is used.

📗 Epsilon-greedy strategy: in every round, use the empirically best action with probability \(1 - \varepsilon\), and use a random action with probability \(\varepsilon\).

# Upper Confidence Bound

📗 The "best" action (based on current information) can also be defined based on the principle of optimism under uncertainty.

📗 An optimistic guess of the average reward (adjusted for uncertainty) in period \(t\) is \(\hat{\mu}_{k} + c \sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\), where \(n_{k}\) is the number of rounds action \(k\) is used.

➩ This term \(\sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\) represents the amount of uncertainty in the estimate \(\hat{\mu}_{k}\), the more action \(k\) is explored (higher \(n_{k}\)), the smaller the value of this term.

📗 The algorithm that always uses the action with the highest UCB is called the UCB1 algorithm: Wikipedia.

Math Note

📗 Technically, \(\sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\) is half of the width of the confidence interval computed based on Hoeffding's inequality: Wikipedia.

In-class Discussion

ID:

📗 Answer: .

# EXP3 Algorithm

📗 In the environment is adversarial, for example, the rewards are chosen by an adversary, then any deterministic algorithm would fail.

📗 EXP3 (EXPonential weight algorithm for EXPloration and EXPloitation) keeps track of a weight vector and select actions randomly based on the weights: Wikipedia

Math Notes

📗 The EXP3 weights are updated based on the rewards:

➩ Probability of choosing action \(k\): \(p_{k} = \left(1 - \varepsilon\right) \dfrac{w_{k}}{w_{1} + w_{2} + ... + w_{K}} + \dfrac{\varepsilon}{K}\).

➩ The updates are given by \(w_{k} = w_{k} e^{\dfrac{\varepsilon r_{t}}{p_{k} K}}\) when \(k\) is used in round \(t\). Other weights are not updated.

# Questions?

📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.

Additional In-class Discussion

📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

Notes (not visible to other students):
[Q1] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Submit your answer to see other students answers (click the submit button to refresh):

Additional In-class Quiz

📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

A.
B.
C.
D.
E.
Notes (not visible to other students):
[Q2] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Submit your answer to see other students answers (click the submit button to refresh):

# In-class Quiz Instructions

📗 To get full points on the in-class quizzes for a lecture:

➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.

➩ Some questions require [notes] to earn the point.

➩ Some questions require special ID (given during the lecture) to earn the point.

➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.

➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.

➩ The grade on Canvas Assignment Q21 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.

📗 If there are any issues with submission on the website, please use this Google form: Link.

📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .

Prev: L20, Next: L22

Last Updated: July 16, 2026 at 12:17 PM