Young Wu's Homepage

Prev: W2, Next: W4

# Overview

📗 Readings: RL Chapter 4 and AI Chapter 5.

📗 Wikipedia page: Link

# Q Learning

📗 In the planning problem, the optimal policy \(\pi\) is defined as \(\pi\left(s\right) = \mathop{\mathrm{argmax}}_{a} Q\left(s, a\right)\) where the Q function can be computed based on the Bellman equation \(Q\left(s, a\right) = R\left(s, a\right) + \beta \displaystyle\max_{a} \displaystyle\sum_{s'} P\left(s' | s, a\right) V\left(s'\right)\) and the value function is \(V\left(s\right) = \displaystyle\max_{a} Q\left(s, a\right)\).

📗 The value function can also be defined by the discounted sum of the rewards given any policy or \(V\left(s\right) = r\left(s, \pi\left(s\right)\right) + \beta \displaystyle\sum_{s'} \left(P\left(s' | s, \pi\left(s\right)\right) r\left(s', \pi\left(s'\right)\right) + \beta \displaystyle\sum_{s''} \left(P\left(s'' | s', \pi\left(s'\right)\right) r\left(s'', \pi\left(s''\right)\right) + \beta ...\right)\right)\) or \(V\left(s\right) = r\left(s, \pi\left(s\right)\right) + \beta \displaystyle\sum_{s'} P\left(s' | s, \pi\left(s\right)\right) V\left(s'\right)\).

📗 This definition is recursive and the Q function can be computed iteratively.

📗 In the learning problem, the Q function can be estimated through repeated interaction with the environment (for example, using epsilon greedy or UCB type algorithms) using value iteration:

➩ Q learning: \(Q\left(s_{t}, a_{t}\right) = \left(1 - \alpha\right) Q\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \beta \displaystyle\max_{a} Q\left(s_{t+1}, a\right)\right)\).

➩ State-Action-Reward-State-Action (SARSA): \(Q\left(s_{t}, a_{t}\right) = \left(1 - \alpha\right) Q\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \beta Q\left(s_{t+1}, a_{t+1}\right)\right)\).

# Car

# Output:

📗 Q Function (columns are U, D, L, R, S):

📗 V Function:

📗 Policy:

# Settings:

📗 Reward Function:

or by
from to

📗 Discount Factor:

📗 Number of value iteration:

Last Updated: July 01, 2025 at 1:46 AM