# Reinforcement Learning

📗 An agent interacts with an environment and receives a reward (or incurs a cost) based on the state of the environment and the agent's action.
📗 The goal of reinforcement learning is to maximize the cumulative reward by learning the optimal actions in every state.
📗 Unlike search problems, the agent needs to learn the reward or cost function.
➩ Board games: Link.
➩ Video games: Link.
➩ Robotic control: Link, Link.
➩ Autonomous vehicle control: Link.
➩ Economic models: Link.
➩ Large language models (RLHF: Reinforcement Learning from Human Feedback): Link.

# Multi Armed Bandit

📗 A simple reinforcement learning problem where the state does not change is called the multi-armed bandit: Wikipedia.
➩ Multi-armed bandits.
➩ Clinical trials.
➩ Stock selection.
📗 There is a set of actions \(1, 2, ..., K\), reward from action \(k\) follows some distribution with mean \(\mu_{k}\), for example normal distribution with mean \(\mu_{k}\) and fixed variance \(\sigma^{2}\), or \(r \sim N\left(\mu_{k}, \sigma^{2}\right)\).
📗 The agent's goal is to maximize the total reward from repeatedly taking an action in \(T\) rounds.
➩ Reward maximization: \(\displaystyle\max_{a_{1}, a_{2}, ..., a_{T}} \left(r_{1}\left(a_{1}\right) + r_{2}\left(a_{2}\right) + ... + r_{T}\left(a_{T}\right)\right)\).
➩ Regret minimization: \(\displaystyle\min_{a_{1}, a_{2}, ..., a_{T}} \left(\displaystyle\max_{k} \mu_{k} - \dfrac{1}{T} \left(r_{1}\left(a_{1}\right) + r_{2}\left(a_{2}\right) + ... + r_{T}\left(a_{T}\right)\right)\right)\).
📗 An algorithm is called no-regret if as \(T \to  \infty\), the regret approaches \(0\) with probability \(1\): Wikipedia.
📗 [1 points] The boxes have different mean rewards between 0 and 1. Click on one of them to collect the reward from the box. The goal is to maximize the total reward (or minimize the regret) given a fixed number of clicks.

# Exploration vs Exploitation

📗 There is a trade-off between taking actions for exploration vs exploitation:
➩ Exploration: taking actions to get more information (e.g. figure out the expected reward from each action).
➩ Exploitation: taking actions to get the highest rewards based on existing information (e.g. take the best action based on the current estimates of rewards). 
📗 Epsilon-first strategy: \(\varepsilon T\) rounds of pure exploration, then use the empirically best in the remaining \(\left(1 - \varepsilon\right) T\) rounds.
➩ Empirically best action is \(\mathop{\mathrm{argmax}}_{k} \hat{\mu}_{k}\), where \(\hat{\mu}_{k}\) is the average reward from rounds where action \(k\) is used.
📗 Epsilon-greedy strategy: in every round, use the empirically best action with probability \(1 - \varepsilon\), and use a random action with probability \(\varepsilon\).

# Upper Confidence Bound

📗 The "best" action (based on current information) can also be defined based on the principle of optimism under uncertainty.
📗 An optimistic guess of the average reward (adjusted for uncertainty) in period \(t\) is \(\hat{\mu}_{k} + c \sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\), where \(n_{k}\) is the number of rounds action \(k\) is used.
➩ This term \(\sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\) represents the amount of uncertainty in the estimate \(\hat{\mu}_{k}\), the more action \(k\) is explored (higher \(n_{k}\)), the smaller the value of this term.
📗 The algorithm that always uses the action with the highest UCB is called the UCB1 algorithm: Wikipedia.
Math Note (Optional)
📗 Technically, \(\sqrt{\dfrac{2 \log\left(t\right)}{n_{k}}}\) is half of the width of the confidence interval computed based on Hoeffding's inequality: Wikipedia.
TopHat Discussion ID:
