Ben Geisler
November 22,
1998
In this paper, we look at the task of
learning from reinforcements. One-step
connectionist
Q-Learning was implemented to act as a
controller for the agent “GeislerPlayer” in professor Shavlik’s AgentWorld
program. The backpropogation algorithm
is used as the policy function.
This paper summarizes the results of
two different input/ouput representations of the AgentWorld.
The first representation,
GeislerPlayer, contains four input units.
Each input units represents one of the cardinal directions
(N,S,E,W). The input units lead to five
hidden units, and then to four output units, each representing the Q-value of
the possible actions (move N, move S, move E, move W) on the first state
(called s0).
The second representation,
GeislerPlayer mem, contains eight input units.
The first four are the same as in GeislerPlayer, but the remaining four
represent the input units from the previous run of the algorithm. This is a memory approach, since the input
units (s0), of GeislerMem actually remember the previous sensor values. Each of these input units leads into the
five hidden units, and into the same four output units (the Q-values of each
cardinal direction). The policy
function is therefore able to be influenced by sensor values at the current
state, as well as one step previous to the current state.
We have used discounted sum of rewards
as a performance measure for the above policies. Results of ten games (each with 5000 steps), are presented in the
following table:

The GeislerPlayer with memory seems to
perform better across the board. What
follows is a justification for each method.
GeislerPlayer initially had thirty-six
inputs, it contained an input for every sensor. This approach achieved very poor results. I believe that the poor performance was due
to a poor handling of the high percentage of “nothing sensed” values. When an input is “nothing” on one step, it
will have the same argmax value as if it was “nothing” on the next step. In other words- if “nothing” is sensed, the
input value will always be the same.
This differs from any other sensor value, in that it does not vary. Also, “nothing” was the most likely sensor
value for each step. This seemed to
saturate the other hidden nodes, leaving less of a degree of variance than
before. Ultimately, over many steps-
the algorithm would choose a “favored” direction and almost never left this
path.
Because of this, GeislerPlayer was
narrowed to four input units- one for each cardinal direction.
With only four input units, the hidden
nodes will be less saturated by similarly values. If, on any step there is a single value which is not “nothing”,
it will stand out above the other values, and influence the network
accordingly.
However, this modification allows for
less dynamic decisions. Each sensor
reading is fairly limited. For this
reason, the four sensor readings of step t-1 have been added. This provides for a larger view of the
playing field. The hidden units treat
the extra four inputs in the same manner as the actual sensor readings. This provides for a larger picture of the
move space. For example, say our agent
takes sensor readings at state s0, and decides to move left. The new state s1
now has inputs of s1 and s0. If state
s0 had a very poor argmax value for moving down (Q(s0, down)), then this will
carry through to the value of Q(s1, down).
As another example, consider an agent
who determined that it would move left, since moving down would have produced a
punishment from bumping into a player.
The next state will “remember” this observation, and it’s Q value for
down will be effected by it. This is a
useful result since players and objects take up the space of two moves. GeislerPlayer with memory has a better
knowledge of where it came from.
Ten games were played, the algorithms
ran simultaneously in the agent world. The standard GeislerPlayer agent scored
an average of 741 points per game, and the GeislerPlayer w/memory agent scored
an average of 2,140 points per game.
These experimental results would indicate that, in this case, the memory
approach produces better results.
It would make sense that GeislerPlayer
w/memory should score at least 200% better than the memory-less approach, since
it can remember twice as many states.
Further testing could be done, for example- the number of inputs could
be raised to four (the current state, and the three previous). For reasons of CPU time, I was not able to
complete this extension. The
implementation with four state readings was run three times, with each game
producing better results than the two state approach[1].
[1] Results of current state, and three previous states: Game1: 3048, Game2: 2983, Game3: 3433. Each game took apx. 5 hours to run on my pentium 200mhz, for this reason I was unable to run 10 games.