Reinforcement Learning

In reinforcement learning problems the feedback is simply a scalar value which may be delayed in time. This reinforcement signal reflects the success or failure of the entire system after it has performed some sequence of actions. Hence the reinforcement signal does not assign credit or blame to any one action (the temporal credit assignment problem), or to any particular node or system element (the structural credit assignment problem).

In contrast, in supervised learning the feedback is available after each system action, removing the temporal credit assignment problem; in addition, it indicates the error of individual nodes instead of simply telling how good the outcome was. Supervised learning methods, for instance back-propagation, off-line clustering, mathematical optimization, and ID3, rely on having error signals for the system's output nodes, and typically train on a fixed set of examples which is known in advance. But not all learning problems fit this paradigm. Reinforcement learning methods are appropriate when the system is required to learn on-line, or a teacher is not available to furnish error signals or target outputs. Examples include:

Game playing
If there is no teacher, the player must be able to determine which actions were critical to the outcome and then alter its heuristics accordingly.
Learning in a micro-world
The agent must develop the ability to categorize its perceptions, and to correlate his awareness of its environment with the satisfaction of primitive drives such as pleasure and pain.
On-line control
Controllers of automated processes such as gas pipelines or manufacturing systems must adapt to a dynamically changing environment, where the optimal heuristics are usually not known.
Autonomous robot exploration
Autonomous robots may make feasible exploration of hazardous environments such as the ocean and outer space, using on-line learning to adapt to changing and unforeseen conditions.


Feature extraction---an important subproblem

My work focuses on feature extraction, the development of the system's input representation. Since the reinforcement feedback is not an error signal for individual system elements, it gives little guidance for feature extraction. One reason is that if the system fails by choosing the wrong action, the feedback does not specify which of the output nodes was wrong. In a system which chooses its action by selecting the most active output node, an error can be caused by either having a node be too active for a given input, or by other nodes not being active enough.

If the system has a hidden layer of feature detectors, another reason for poor feature extraction is that acting properly depends on both identifying the current context as well as selecting an action appropriate to that context. A scalar feedback signal does not indicate which of these processes is at fault. The feedback does not distinguish between the case where the system rightly identified its context but selected the wrong response, and the case where the system's learned responses are correct, but its feature detectors misidentified the context. In terms of a typical neural network implementation, the system needs to know whether it should tune its feature detectors, or the weights placed on the outputs of those feature detectors, or both. The sparse reinforcement signal does not furnish this information. Thus learning methods for reinforcement learning problems may need bottom-up information or some type of internal feedback to supplement the top-down feedback.


David J. Finton November 18, 1994