Prev: W1, Next: W3

# Q Learning

📗 In the planning problem, the optimal policy \(\pi\) is defined as \(\pi\left(s\right) = \mathop{\mathrm{argmax}}_{a} Q\left(s, a\right)\) where the Q function can be computed based on the Bellman equation \(Q\left(s, a\right) = R\left(s, a\right) + \beta \displaystyle\max_{a} \displaystyle\sum_{s'} P\left(s' | s, a\right) V\left(s'\right)\) and the value function is \(V\left(s\right) = \displaystyle\max_{a} Q\left(s, a\right)\).
📗 The value function can also be defined by the discounted sum of the rewards given any policy or \(V\left(s\right) = r\left(s, \pi\left(s\right)\right) + \beta \displaystyle\sum_{s'} \left(P\left(s' | s, \pi\left(s\right)\right) r\left(s', \pi\left(s'\right)\right) + \beta \displaystyle\sum_{s''} \left(P\left(s'' | s', \pi\left(s'\right)\right) r\left(s'', \pi\left(s''\right)\right) + \beta ...\right)\right)\) or \(V\left(s\right) = r\left(s, \pi\left(s\right)\right) + \beta \displaystyle\sum_{s'} P\left(s' | s, \pi\left(s\right)\right) V\left(s'\right)\).
📗 This definition is recursive and the Q function can be computed iteratively.
📗 In the learning problem, the Q function can be estimated through repeated interaction with the environment (for example, using epsilon greedy or UCB type algorithms) using value iteration:
➩ Q learning: \(Q\left(s_{t}, a_{t}\right) = \left(1 - \alpha\right) Q\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \beta \displaystyle\max_{a} Q\left(s_{t+1}, a\right)\right)\).
➩ State-Action-Reward-State-Action (SARSA): \(Q\left(s_{t}, a_{t}\right) = \left(1 - \alpha\right) Q\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \beta Q\left(s_{t+1}, a_{t+1}\right)\right)\).

# Car





# Output:

📗 Q Function (columns are U, D, L, R, S):
📗 V Function:
📗 Policy:

# Settings:

📗 Reward Function:  
or by
from to
📗 Discount Factor:
📗 Number of value iteration:







Last Updated: November 30, 2024 at 4:34 AM