My primary interest lies in reinforcement learning. For both research and learning purposes, I decided to write programs
in Java to experiment with various reinforcement learning algorithms. This page includes records of these experiments
and any additions to my Java library that resulted from them.
Markov Decision Processes
This section contains experiments related to algorithms taught in COMP SCI 839 - Theory of Reinforcement Learning. An
MDP is represented by an abstract class containing abstract methods outlining the reward function, transition
probabilities and the initial state distribution. Using this abstract class allows for a mathematical model that may be
used as an interface to the algorithms to be implemented. For experimentation, a basic environment simulating movement
on a grid will be used.
The first topic is planning, where the optimal policy of an MDP is computed using full information about the
environment. Four algorithms are presented. Value iteration repeatedly applies a Bellman update to a Q function table.
The implementation I created performs the Bellman update on the value function instead. The second algorithm is policy
iteration, which alternates between computing the value function associated with a policy and determining the optimal
policy given a value function.
Plot of value at each step for both value iteration and policy iteration. Both converge to the optimal value,
but value iteration takes more iterations to converge.
This plot shows the value of the output policy for both algorithms after each iteration. Not shown in the plot is that
a single step of policy iteration took much longer than a single step in value iteration due to the computational
complexity of the closed form equation used to calculate the value of a policy. It should also be noted that although
value iteration found an optimal policy rather quickly, its stored value function was still nowhere close to
convergence. The conclusion is that value iteration is more efficient to perform.