Prev: W10, Next: W12 , Practice Questions: M31 , Links: Canvas, Piazza, Zoom, TopHat (744662)
Tools
📗 Calculator:
📗 Canvas:


📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

Slide:  



# Reinforcement Learning

📗 An agent interacts with an environment and receives a reward (or incurs a cost) based on the state of the environment and the agent's action.
📗 The goal of reinforcement learning is to maximize the cumulative reward by learning the optimal actions in every state.
📗 Unlike search problems, the agent needs to learn the reward or cost function.
➩ Board games: Link.
➩ Video games: Link.
➩ Robotic control: Link, Link.
➩ Autonomous vehicle control: Link.
➩ Economic models: Link.
➩ Large language models (RLHF: Reinforcement Learning from Human Feedback): Link.



# Multi Armed Bandit

📗 A simple reinforcement learning problem where the state does not change is called the multi-armed bandit: Wikipedia.
➩ Multi-armed bandits.
➩ Clinical trials.
➩ Stock selection.
📗 There is a set of actions \(1, 2, ..., K\), reward from action \(k\) follows some distribution with mean \(\mu_{k}\), for example normal distribution with mean \(\mu_{k}\) and fixed variance \(\sigma^{2}\), or \(r \sim N\left(\mu_{k}, \sigma^{2}\right)\).
📗 The agent's goal is to maximize the total reward from repeatedly taking an action in \(T\) rounds.
➩ Reward maximization: \(\displaystyle\max_{a_{1}, a_{2}, ..., a_{T}} \left(r_{1}\left(a_{1}\right) + r_{2}\left(a_{2}\right) + ... + r_{T}\left(a_{T}\right)\right)\).
➩ Regret minimization: \(\displaystyle\min_{a_{1}, a_{2}, ..., a_{T}} \left(\displaystyle\max_{k} \mu_{k} - \dfrac{1}{T} \left(r_{1}\left(a_{1}\right) + r_{2}\left(a_{2}\right) + ... + r_{T}\left(a_{T}\right)\right)\right)\).
📗 An algorithm is called no-regret if as \(T \to  \infty\), the regret approaches \(0\) with probability \(1\): Wikipedia.
TopHat Discussion ID:
📗 [1 points] The boxes have different mean rewards between 0 and 1. Click on one of them to collect the reward from the box. The goal is to maximize the total reward (or minimize the regret) given a fixed number of clicks. You have clicks left. Your current total reward is . Refresh the page to restart. Which box has the largest mean reward?


📗 Answer: .




# Exploration vs Exploitation

📗 There is a trade-off between taking actions for exploration vs exploitation:
➩ Exploration: taking actions to get more information (e.g. figure out the expected reward from each action).
➩ Exploitation: taking actions to get the highest rewards based on existing information (e.g. take the best action based on the current estimates of rewards). 
📗 Epsilon-first strategy: \(\varepsilon T\) rounds of pure exploration, then use the empirically best in the remaining \(\left(1 - \varepsilon\right) T\) rounds.
➩ Empirically best action is \(\mathop{\mathrm{argmax}}_{k} \hat{\mu}_{k}\), where \(\hat{\mu}_{k}\) is the average reward from rounds where action \(k\) is used.
📗 Epsilon-greedy strategy: in every round, use the empirically best action with probability \(1 - \varepsilon\), and use a random action with probability \(\varepsilon\).



# Upper Confidence Bound

📗 The "best" action (based on current information) can also be defined based on the principle of optimism under uncertainty.
📗 An optimistic guess of the average reward (adjusted for uncertainty) is \(\hat{\mu}_{k} + c \sqrt{\dfrac{2 \log\left(T\right)}{n_{k}}}\), where \(n_{k}\) is the number of rounds action \(k\) is used.
➩ This term \(\sqrt{\dfrac{2 \log\left(T\right)}{n_{k}}}\) represents the amount of uncertainty in the estimate \(\hat{\mu}_{k}\), the more action \(k\) is explored (higher \(n_{k}\)), the smaller the value of this term.
📗 The algorithm that always uses the action with the highest UCB is called the UCB1 algorithm: Wikipedia.
Math Note (Optional)
📗 Technically, \(\sqrt{\dfrac{2 \log\left(T\right)}{n_{k}}}\) is half of the width of the confidence interval computed based on Hoeffding's inequality: Wikipedia.
TopHat Discussion ID:
📗 [1 points] The boxes have different mean rewards between 0 and 1. Click on one of them to collect the reward from the box. The goal is to maximize the total reward (or minimize the regret) given a fixed number of clicks. You have clicks left. Your current total reward is . Refresh the page to restart. Which box has the largest mean reward?


📗 Answer: .




# Markov Decision Process

📗 Reinforcement learning problem with multiple states is usually represented by Markov Decision Processes (MDP): LinkWikipedia.
📗 Markov property on states and actions are assumed: \(\mathbb{P}\left\{s_{t+1} | s_{t}, a_{t}, s_{t-1}, a_{t-1}, ...\right\} = \mathbb{P}\left\{s_{t+1} | s_{t}, a_{t}\right\}\). The state in round \(t+1\), \(s_{t+1}\) only depends on the state and action in round \(t\), \(s_{t}\) and \(a_{t}\).
📗 The goal is learn a policy \(\pi\) to choose action \(\pi\left(s_{t}\right)\) that maximize the total expected discounted rewards: \(\mathbb{E}\left[r_{t} + \beta r_{t+1} + \beta^{2} r_{t+2} + ...\right]\).
➩ The reason a discount factor is used is so that the infinity sum is finite.
➩ Note that if the rewards are between \(0\) and \(1\), then the discounted total rewards is less than \(1 + \beta + \beta^{2} + ... = \dfrac{1}{1 - \beta}\).
Math Note
📗 To compute the sum, \(S = 1 + \beta + \beta^{2} + ...\), note that \(\beta S = \beta + \beta^{2} + \beta^{3} + ...\), so \(\left(1 - \beta\right) S = 1\) or \(S = \dfrac{1}{1 - \beta}\).
➩ This requires \(0 \leq \beta < 1\).



# Value Function

📗 The value function is the expected discounted reward given a policy function \(\pi\), or \(V^{\pi}\left(s_{t}\right) = \mathbb{E}\left[r_{t}\right] + \beta \mathbb{E}\left[r_{t+1}\right] + \beta^{2} \mathbb{E}\left[r_{t+2}\right] + ...\), where \(r_{t}\) is generated based on \(\pi\).
➩ To be precise, \(V^{\pi}\left(s_{t}\right) = \mathbb{E}\left[r_{t} | s_{t}, \pi\left(s_{t}\right)\right] + \beta \mathbb{E}\left[r_{t+1} | s_{t+1}, \pi\left(s_{t+1}\right)\right] + \beta^{2} \mathbb{E}\left[r_{t+2} | s_{t+2}, \pi\left(s_{t+2}\right)\right] + ...\).
📗 The value function is the (only) function that satisfies the Bellman's equation: \(V\left(s_{t}\right) = \mathbb{E}\left[r_{t} | s_{t}, a_{t}\right] + \beta \mathbb{E}\left[V\left(s_{t+1} | s_{t}, a_{t}\right)\right]\).
📗 The optimal policy \(\pi^\star\) is the policy that maximizes the value function at every state.



# Q Function

📗 The Q function is the value function given a specific action in the current period (and follows \(\pi\) in future periods).
➩ By definition, \(Q^{\pi}\left(s_{t}, a_{t}\right) = \mathbb{E}\left[r_{t} | s_{t}, a_{t}\right] + \beta \mathbb{E}\left[r_{t+1} | s_{t+1}, \pi\left(s_{t+1}\right)\right] + \beta^{2} \mathbb{E}\left[r_{t+2} | s_{t+2}, \pi\left(s_{t+2}\right)\right] + ...\), which can be written as \(Q^{\pi}\left(s_{t}, a_{t}\right) = \mathbb{E}\left[r_{t} | s_{t}, a_{t}\right] + \beta \mathbb{E}\left[V\left(s_{t+1} | s_{t}, a_{t}\right)\right]\).
📗 Under the optimal policy \(\pi^\star\), \(V^\star\left(s\right) = \displaystyle\max_{a} Q\left(s, a\right)\) for every state, and this can be used to iteratively solve for the Q function: \(Q^\star\left(s, a\right) = \mathbb{E}\left[r | s, a\right] + \beta \mathbb{E}\left[Q^\star\left(s' | s, a\right)\right]\).
➩ Given the expected rewards, the \(Q^\star\) function can be approximated by iteratively updating \(Q\left(s, a\right) = R\left(s, a\right) + \beta \displaystyle\max_{a'} \mathbb{E}\left[Q\left(s', a'\right)\right]\), where \(R\left(s, a\right)\) is the expected reward at state \(s\) when action \(a\) is used.
TopHat Discussion ID:
📗 [1 points] Compute the optimal policy (at each state the car can go Up, Down, Left, Right, or Stay). The color represents the reward from moving to each state (more red means more negative and more blue means more positive). Click on the plot on the right to update the Q function once. The discount factor is .

📗 Q Function (columns are U, D, L, R, S):
📗 V Function:
📗 Policy:




# Q Learning

📗 The Q function can be learned by iteratively update the Q function using the Bellman's equation: Wikipedia.
➩ In the deterministic case where \(\mathbb{P}\left\{s' | s, a\right\}\) is either \(0\) or \(1\), \(\hat{Q}\left(s, a\right) = r + \beta \displaystyle\max_{a'} \hat{Q}\left(s', a'\right)\).
➩ In the non-deterministic case, \(\hat{Q}\left(s, a\right) = \left(1 - \alpha\right) \hat{Q}\left(s, a\right) + \alpha \left(r + \beta \displaystyle\max_{a'} \hat{Q}\left(s', a'\right)\right)\), where \(\alpha\) is the learning rate and is sometimes set to \(\dfrac{1}{1 + n\left(s, a\right)}\), \(n\left(s, a\right)\) is the number of visits to state \(s\) and action \(a\) in the past.
📗 Q learning converges to the correct Q function, and the optimal policy can be obtained by: \(\pi^\star\left(s\right) = \mathop{\mathrm{argmax}}_{a} Q^\star\left(s, a\right)\) for every state.
TopHat Quiz (Past Exam Question) ID:
📗 [4 points] Consider the following Markov Decision Process. It has two states \(s\), A and B. It has two actions \(a\): move and stay. The state transition is deterministic: "move" moves to the other state, while "stay" stays at the current state. The reward \(r\) is for move, for stay. Suppose the discount rate is \(\gamma\) = .

Find the Q table \(Q_{i}\) after \(i\) = updates of every entry using Q value iteration (\(i = 0\) initializes all values to \(0\)) in the format described by the following table. Enter a two by two matrix.
State \ Action stay move
A ? ?
B ? ?

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .




# SARSA

📗 An alternative to Q learning is SARSA (State Action Reward State Action). It uses a pre-specified action for the next period instead of the optimal action based on the current Q estimate: Wikipedia.
➩ In the deterministic case, \(\hat{Q}\left(s, a\right) = r + \beta \hat{Q}\left(s', a'\right)\).
➩ In the non-deterministic case, \(\hat{Q}\left(s, a\right) = \left(1 - \alpha\right) \hat{Q}\left(s, a\right) + \alpha \left(r + \beta \hat{Q}\left(s', a'\right)\right)\).
📗 The main difference is the action used in state \(s'\).
➩ Q learning is an off-policy learning algorithm since \(a'\) is the optimal policy in the next period, not a pre-specified policy (the Q function during learning does not correspond to any policy).
➩ SARSA is an on-policy learning algorithm since it computes the Q function based on a fixed policy.



# Exploration vs Exploitation

📗 The policy used to generate the data for Q learning or SARSA can be Epsilon Greedy or UCB (requires some modification for MDPs).
📗 The choice of action can be randomized too: \(\mathbb{P}\left\{a | s\right\} = \dfrac{c^{\hat{Q}\left(s, a\right)}}{c^{\hat{Q}\left(s, 1\right)} + c^{\hat{Q}\left(s, 2\right)} + ... + c^{\hat{Q}\left(s, K\right)}}\), where \(c\) is a parameter controlling the trade-off between exploration and exploitation.



# Deep Q Learning

📗 In practice, Q function stored as a table is too large if the number of states is large or infinite (the action space is usually finite): Link, Wikipedia.
📗 If there are \(m\) binary features that represent the state, then the Q table contains \(2^{m} \left| A \right|\), which can be intractable.
📗 In this case, a neural network can be used to store the Q function, and if there is a single layer with \(m\) units, then only \(m^{2} + m \left| A \right|\) weights are needed.
➩ The input of the network \(\hat{Q}\) is the features of the state \(s\), and the outputs of the network are the actions \(a\) (the output layer does not need to be softmax since the Q values for different actions do not need to sum up to 1).
➩ After every iteration, the network can be updated based on an item \(\left(s_{t}, \left(1 - \alpha\right) \hat{Q}\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \beta \displaystyle\max_{a_{t+1}} \hat{Q}\left(s_{t+1}, a_{t+1}\right)\right)\right)\).



# Deep Q Network

📗 Deep Q Learning algorithm is unstable since the training label uses the network itself. The following improvements can be made to make the algorithm more stable, called DQN (Deep Q Network).
➩ Target network: two networks can be used, the Q network \(\hat{Q}\) and the target network \(Q'\), and the new item for training \(\hat{Q}\) can be changed to \(\left(s_{t}, \left(1 - \alpha\right) Q'\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \beta \displaystyle\max_{a_{t+1}} Q'\left(s_{t+1}, a_{t+1}\right)\right)\right)\).
➩ Experience replay: some training data can be saved and a mini-batch can be used to update \(\hat{Q}\) instead of a single item.



# Policy Gradient

📗 Deep Q Network estimates the Q function as a neural network, and the policy can be computed as \(\pi\left(s\right) = \mathop{\mathrm{argmax}}_{a} \hat{Q}\left(s, a\right)\), which is always deterministic.
➩ In single-agent reinforcement learning, there is always a deterministic optimal policy, so DQN can be used to solve for an optimal policy.
➩ In multi-agent reinforcement learning, the optimal equilibrium policy can be all stochastic (called mixed strategy equilibrium), so DQN would not work in this case: Wikipedia.
📗 The policy can also be represented by a neural network called the policy network and it can be trained with or without using a Q network.
➩ Without Q network: REINFORCE (REward Increment = Non-negative Factor x Offset x Characteristic Eligibility).
➩ With Q network: A2C (Advantage Actor Critic) and PPO (Proximal Policy Optimization).
📗 To train the policy network \(\hat{\pi}\), the input is the features of the state \(s\) and the output units (softmax output layer) are the probability of using each action \(a\). The cost function the network minimizes can be value function or the advantage function (difference between value and Q).



📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.
📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.
📗 If you missed the TopHat quiz questions, please submit the form: Form.
📗 Anonymous feedback can be submitted to: Form.

Prev: W10, Next: W12





Last Updated: November 18, 2024 at 11:43 PM