Young Wu's Homepage

Prev: L21, Next: L23
Course Links: Canvas, Piazza, TopHat (212925)
Zoom Links: MW 4:00, TR 1:00, TR 2:30.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# Reinforcement Learning

📗 An agent interacts with an environment and receives a reward (or incurs a cost) based on the state of the environment and the agent's action.

📗 The goal of reinforcement learning is to maximize the cumulative reward by learning the optimal actions in every state.

📗 Unlike search problems, the agent needs to learn the reward or cost function.

➩ Board games: Link.

➩ Video games: Link.

➩ Robotic control: Link, Link.

➩ Autonomous vehicle control: Link.

➩ Economic models: Link.

➩ Large language models (RLHF: Reinforcement Learning from Human Feedback): Link.

# Multi Armed Bandit

📗 A simple reinforcement learning problem where the state does not change is called the multi-armed bandit: Wikipedia.

➩ Multi-armed bandits.

➩ Clinical trials.

➩ Stock selection.

📗 There is a set of actions \(1, 2, ..., K\), reward from action \(k\) follows some distribution with mean \(\mu_{k}\), for example normal distribution with mean \(\mu_{k}\) and fixed variance \(\sigma^{2}\), or \(r \sim N\left(\mu_{k}, \sigma^{2}\right)\).

📗 The agent's goal is to maximize the total reward from repeatedly taking an action in \(T\) rounds.

➩ Reward maximization: \(\displaystyle\max_{a_{1}, a_{2}, ..., a_{T}} \left(r_{1}\left(a_{1}\right) + r_{2}\left(a_{2}\right) + ... + r_{T}\left(a_{T}\right)\right)\).

➩ Regret minimization: \(\displaystyle\min_{a_{1}, a_{2}, ..., a_{T}} \left(\displaystyle\max_{k} \mu_{k} - \dfrac{1}{T} \left(r_{1}\left(a_{1}\right) + r_{2}\left(a_{2}\right) + ... + r_{T}\left(a_{T}\right)\right)\right)\).

📗 An algorithm is called no-regret if as \(T \to \infty\), the regret approaches \(0\) with probability \(1\): Wikipedia.

TopHat Discussion

ID:

📗 [1 points] The boxes have different mean rewards between 0 and 1. Click on one of them to collect the reward from the box. The goal is to maximize the total reward (or minimize the regret) given a fixed number of clicks. You have clicks left. Your current total reward is . Refresh the page to restart. Which box has the largest mean reward?

📗 Answer: .

# Exploration vs Exploitation

📗 There is a trade-off between taking actions for exploration vs exploitation:

➩ Exploration: taking actions to get more information (e.g. figure out the expected reward from each action).

➩ Exploitation: taking actions to get the highest rewards based on existing information (e.g. take the best action based on the current estimates of rewards).

📗 Epsilon-first strategy: \(\varepsilon T\) rounds of pure exploration, then use the empirically best in the remaining \(\left(1 - \varepsilon\right) T\) rounds.

➩ Empirically best action is \(\mathop{\mathrm{argmax}}_{k} \hat{\mu}_{k}\), where \(\hat{\mu}_{k}\) is the average reward from rounds where action \(k\) is used.

📗 Epsilon-greedy strategy: in every round, use the empirically best action with probability \(1 - \varepsilon\), and use a random action with probability \(\varepsilon\).

# Upper Confidence Bound

📗 The "best" action (based on current information) can also be defined based on the principle of optimism under uncertainty.

📗 An optimistic guess of the average reward (adjusted for uncertainty) in period \(t\) is \(\hat{\mu}_{k} + c \sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\), where \(n_{k}\) is the number of rounds action \(k\) is used.

➩ This term \(\sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\) represents the amount of uncertainty in the estimate \(\hat{\mu}_{k}\), the more action \(k\) is explored (higher \(n_{k}\)), the smaller the value of this term.

📗 The algorithm that always uses the action with the highest UCB is called the UCB1 algorithm: Wikipedia.

Math Note (Optional)

📗 Technically, \(\sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\) is half of the width of the confidence interval computed based on Hoeffding's inequality: Wikipedia.

TopHat Discussion

ID:

📗 Answer: .

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form.

Prev: L21, Next: L23

Last Updated: July 01, 2025 at 1:47 AM