Reinforcement Learning experiments

Abstract

My primary interest lies in reinforcement learning. For both research and learning purposes, I decided to write programs in Java to experiment with various reinforcement learning algorithms. This page includes records of these experiments and any additions to my Java library that resulted from them.

Markov Decision Processes

An infinite horizon Markov decision process is represented by a collection $M = (S, A, R, P, μ, γ)$ .

$S$ represents the state space.
$A$ represents the action space.
$R : S \times A \to ℜ$ specifies a reward for taking a given action in a given state.
$P : S \times A \to Δ_{S}$ provides a probability distribution over the state space. Given the current state and the action taken, the next state is sampled from this probability distribution.
$μ \in Δ_{S}$ is the probability distribution from which the initial state is sampled.
$γ \in [0, 1)$ is a discount factor applied on future rewards.

To interact with a Markov decision process, first sample the initial state $s_{0}$ using $μ$ . Then, repeat the following for all $t = 0, 1, 2, \dots$ .

Choose an action $a_{t}$ using a policy $π (s_{t})$ .
Receive reward $r_{t} = R (s_{t}, a_{t})$ based on the current state.
Determine the next state $s_{t + 1}$ by sampling from $P (s_{t}, a_{t})$ .

The goal of reinforcement learning is to find a policy $π$ that maximize the total expected discounted rewards received during this interaction if actions are determined using $π$ . This quantity is called the value and is dependent on the chosen policy. Note that the discount factor is required to make the following summation finite because the interaction never terminates.

V^{π} = E [\sum_{t = 0}^{∞} γ^{t} r_{t}]

The value function is a core concept in reinforcement learning. Some variations of the value function add a condition to the expectation. The state value function is the total expected discounted rewards given the current state. The action value function is the total expected discounted rewards given the current state and the first action taken.

Planning

Planning algorithms determine an optimal policy when given full information about the Markov decision process. Four standard algorithms exist for solving an infinite horizon Markov decision process.

Value iteration attempts to iteratively calculate an estimate for the action value function by applying the Bellman operator. After a set number of iterations, a policy can be extracted from the action value function by taking an action that maximizes the action value in each state.
Policy iteration uses a closed form linear equation to find the state value function associated with a policy. A new policy is created by finding the best action for each state using the calculated value function. This process repeats until the value function does not change across iterations.
The remaining two algorithms use linear programming. Although I did perform experiments with these algorithms, my linear program solver was highly unstable and could not handle the complexity of the problem.

Results from value iteration and policy iteration. Policies were extracted after each iteration. The plot shows the value of these policies.

Although it appears that policy iteration performs much better than value iteration, it should be noted that each iteration of policy iteration is more computationally expensive. It should also be noted that although the value function estimate used for value iteration did not converge, an optimal policy was still extracted. As a result, value iteration generally runs faster and is used more commonly in practice.

Visualization of the policy generated by value iteration. Click on the demonstration to reset the simulation.

The visualization demonstrates the policy generated by value iteration. The environment consists of movement on a grid. The agent can choose to move up, down, left, or right. However, there is a chance for the agent to "slip," causing the resulting movement to be perpendicular to the desired movement. There is a single position in the center of the grid where the agent receives a reward, so the agent tends to move towards the center.

Multi-Armed Bandits

Multi-armed bandits are a specific type of Markov decision process in which the state space is a singleton. However in bandit settings, the reward given to the agent after an action is taken usually is not deterministic. The agent would therefore like to perform the action that maximizes the expected reward.

In a planning setting, the solution to this problem is trivial. Instead, we care about the agent's policy when it is given no information about the environment. In these cases, the agent must balance exploration and exploitation.

Exploration consists of trying various different actions to gain information about the reward distributions. Although exploration does not attempt to maximize the rewards in the current time step, the information gain may help the agent determine optimal future actions.
Exploitation uses past information and determines action that maximizes the reward for the current time step.

The delicate balance is to combine both exploration and exploitation to maximize the total expected reward while also considering rewards gained along the way. To quantify this, we use the notion of regret.

Regret (a_{i}) = μ^{*} - μ_{i}

The regret associated with taking an action is the difference between the expected reward for that action and the expected reward for the optimal action. If the regret for an action is zero, then that action is optimal. For online learning algorithms, we consider the total regret across all actions taken over the course of the episode. As more actions are taken, we would like the total regret to be sublinear with respect to the number of actions taken.

For the bandit setting, we have the Upper Confidence Bound (UCB) algorithm. When choosing an action, UCB considers the empirical average reward for each action, along with how many times each action was played. It then creates a pseudo-reward that favors actions with a high empirical reward expectation or a low sample size.

\tilde{r} (a_{i}) = {\hat{μ}}_{i} + c \cdot \sqrt{\frac{1}{N_{i}}}

If the constant $c$ is chosen properly, theory provide a guarantee that the total regret is bounded by a square root curve.

Graph of the total regret and average regret. The total regret is sublinear, so the average regret decreases as more actions are played. The total regret plot strongly resembles the square root function, matching the theoretical guarantees for the UCB algorithm.

Reinforcement Learning experiments

Abstract

Markov Decision Processes

Planning

Multi-Armed Bandits

Markov Games

Partially Observable Environments

Attack Research