# Reinforcement Learning
📗 The environment is modeled by a Markov Decision Process (MDP): \(M = \left(S, A, R, P\right)\), where \(S\) is the set of states, \(A\) is the set of actions, \(R\) is the reward function where \(R\left(s, a\right)\) is the reward from state \(s \in S\) and action \(a \in A\), and \(P\) is the transition function where \(P\left(s' | s, a\right)\) is the probability that the state becomes \(s' \in S\) given action \(a \in A\) is performed in state \(s \in S\), and \(P\left(s\right)\) is the probability that the initial state is \(s\).
📗 The planning problem is when the agent is given the MDP and tries to find the optimal policy \(\pi\) where \(\pi\left(s\right) \in A\) is the optimal action in state \(s\).
📗 The online learning problem is when the agent does not know \(R, P\) and tries to learn the optimal \(\pi\) through interaction with \(M\). In period \(t\), an agent in state \(s_{t}\) performs an action \(a_{t}\) and obtains feedback \(r_{t}\) and the environment transitions to \(s_{t+1}\).
➩ Model-based: the agent can learn \(R, P\) and then figure out \(\pi\) through planning.
➩ Model-free: the agent can learn \(\pi\) directly without estimating \(R, P\).
📗 The offline learning problem is when the agent cannot interact with \(M\) and is only given a training data \(\left\{\left(s_{t}, a_{t}, r_{t}, s_{t+1}\right)\right\}\). The policy used to obtain the offline data can be unknown. The learning can be model-based or model-free too.