My primary interest lies in reinforcement learning. For both research and learning purposes, I decided to write programs in Java to experiment with various reinforcement learning algorithms. This page includes records of these experiments and any additions to my Java library that resulted from them.
An infinite horizon Markov decision process is represented by a collection .
To interact with a Markov decision process, first sample the initial state using . Then, repeat the following for all .
The goal of reinforcement learning is to find a policy that maximize the total expected discounted rewards received during this interaction if actions are determined using . This quantity is called the value and is dependent on the chosen policy. Note that the discount factor is required to make the following summation finite because the interaction never terminates.
The value function is a core concept in reinforcement learning. Some variations of the value function add a condition to the expectation. The state value function is the total expected discounted rewards given the current state. The action value function is the total expected discounted rewards given the current state and the first action taken.
Planning algorithms determine an optimal policy when given full information about the Markov decision process. Four standard algorithms exist for solving an infinite horizon Markov decision process.
Although it appears that policy iteration performs much better than value iteration, it should be noted that each iteration of policy iteration is more computationally expensive. It should also be noted that although the value function estimate used for value iteration did not converge, an optimal policy was still extracted. As a result, value iteration generally runs faster and is used more commonly in practice.
The visualization demonstrates the policy generated by value iteration. The environment consists of movement on a grid. The agent can choose to move up, down, left, or right. However, there is a chance for the agent to "slip," causing the resulting movement to be perpendicular to the desired movement. There is a single position in the center of the grid where the agent receives a reward, so the agent tends to move towards the center.
Multi-armed bandits are a specific type of Markov decision process in which the state space is a singleton. However in bandit settings, the reward given to the agent after an action is taken usually is not deterministic. The agent would therefore like to perform the action that maximizes the expected reward.
In a planning setting, the solution to this problem is trivial. Instead, we care about the agent's policy when it is given no information about the environment. In these cases, the agent must balance exploration and exploitation.
The delicate balance is to combine both exploration and exploitation to maximize the total expected reward while also considering rewards gained along the way. To quantify this, we use the notion of regret.
The regret associated with taking an action is the difference between the expected reward for that action and the expected reward for the optimal action. If the regret for an action is zero, then that action is optimal. For online learning algorithms, we consider the total regret across all actions taken over the course of the episode. As more actions are taken, we would like the total regret to be sublinear with respect to the number of actions taken.
For the bandit setting, we have the Upper Confidence Bound (UCB) algorithm. When choosing an action, UCB considers the empirical average reward for each action, along with how many times each action was played. It then creates a pseudo-reward that favors actions with a high empirical reward expectation or a low sample size.
If the constant is chosen properly, theory provide a guarantee that the total regret is bounded by a square root curve.