Learning From Reinforcements: Q-Learning

Ben Geisler

November 22, 1998

In this paper, we look at the task of learning from reinforcements. One-step connectionist

Q-Learning was implemented to act as a controller for the agent “GeislerPlayer” in professor Shavlik’s AgentWorld program. The backpropogation algorithm is used as the policy function.

This paper summarizes the results of two different input/ouput representations of the AgentWorld.

The first representation, GeislerPlayer, contains four input units. Each input units represents one of the cardinal directions (N,S,E,W). The input units lead to five hidden units, and then to four output units, each representing the Q-value of the possible actions (move N, move S, move E, move W) on the first state (called s0).

The second representation, GeislerPlayer mem, contains eight input units. The first four are the same as in GeislerPlayer, but the remaining four represent the input units from the previous run of the algorithm. This is a memory approach, since the input units (s0), of GeislerMem actually remember the previous sensor values. Each of these input units leads into the five hidden units, and into the same four output units (the Q-values of each cardinal direction). The policy function is therefore able to be influenced by sensor values at the current state, as well as one step previous to the current state.

We have used discounted sum of rewards as a performance measure for the above policies. Results of ten games (each with 5000 steps), are presented in the following table:

Motivation

The GeislerPlayer with memory seems to perform better across the board. What follows is a justification for each method.

GeislerPlayer initially had thirty-six inputs, it contained an input for every sensor. This approach achieved very poor results. I believe that the poor performance was due to a poor handling of the high percentage of “nothing sensed” values. When an input is “nothing” on one step, it will have the same argmax value as if it was “nothing” on the next step. In other words- if “nothing” is sensed, the input value will always be the same. This differs from any other sensor value, in that it does not vary. Also, “nothing” was the most likely sensor value for each step. This seemed to saturate the other hidden nodes, leaving less of a degree of variance than before. Ultimately, over many steps- the algorithm would choose a “favored” direction and almost never left this path.

Because of this, GeislerPlayer was narrowed to four input units- one for each cardinal direction.

With only four input units, the hidden nodes will be less saturated by similarly values. If, on any step there is a single value which is not “nothing”, it will stand out above the other values, and influence the network accordingly.

However, this modification allows for less dynamic decisions. Each sensor reading is fairly limited. For this reason, the four sensor readings of step t-1 have been added. This provides for a larger view of the playing field. The hidden units treat the extra four inputs in the same manner as the actual sensor readings. This provides for a larger picture of the move space. For example, say our agent takes sensor readings at state s0, and decides to move left. The new state s1 now has inputs of s1 and s0. If state s0 had a very poor argmax value for moving down (Q(s0, down)), then this will carry through to the value of Q(s1, down).

As another example, consider an agent who determined that it would move left, since moving down would have produced a punishment from bumping into a player. The next state will “remember” this observation, and it’s Q value for down will be effected by it. This is a useful result since players and objects take up the space of two moves. GeislerPlayer with memory has a better knowledge of where it came from.

Results to Confirm this Hypothesis

Ten games were played, the algorithms ran simultaneously in the agent world. The standard GeislerPlayer agent scored an average of 741 points per game, and the GeislerPlayer w/memory agent scored an average of 2,140 points per game. These experimental results would indicate that, in this case, the memory approach produces better results.

It would make sense that GeislerPlayer w/memory should score at least 200% better than the memory-less approach, since it can remember twice as many states. Further testing could be done, for example- the number of inputs could be raised to four (the current state, and the three previous). For reasons of CPU time, I was not able to complete this extension. The implementation with four state readings was run three times, with each game producing better results than the two state approach[1].

[1] Results of current state, and three previous states: Game1: 3048, Game2: 2983, Game3: 3433. Each game took apx. 5 hours to run on my pentium 200mhz, for this reason I was unable to run 10 games.