In reinforcement learning problems the feedback is simply a scalar
value which may be delayed in time. This reinforcement signal
reflects the success or failure of the entire system after it has
performed some sequence of actions. Hence the reinforcement signal
does not assign credit or blame to any one action (the temporal
credit assignment problem), or to any particular node or system
element (the structural credit assignment problem).
In contrast, in supervised learning the feedback is available after
each system action, removing the temporal credit assignment problem;
in addition, it indicates the error of individual nodes instead of
simply telling how good the outcome was. Supervised learning methods,
for instance back-propagation, off-line clustering, mathematical
optimization, and ID3, rely on having error signals for the system's
output nodes, and typically train on a fixed set of examples which is
known in advance. But not all learning problems fit this paradigm.
Reinforcement learning methods are appropriate when the system is
required to learn on-line, or a teacher is not available to furnish
error signals or target outputs. Examples include:
- Game playing
- If there is no teacher, the player must be able to determine
which actions were critical to the outcome and then alter its
- Learning in a micro-world
- The agent must develop the ability to categorize its perceptions,
and to correlate his awareness of its environment with the
satisfaction of primitive drives such as pleasure and pain.
- On-line control
- Controllers of automated processes such as gas pipelines or
manufacturing systems must adapt to a dynamically changing
environment, where the optimal heuristics are usually not known.
- Autonomous robot exploration
- Autonomous robots may make feasible exploration of hazardous
environments such as the ocean and outer space, using on-line
learning to adapt to changing and unforeseen conditions.
Feature extraction---an important subproblem
My work focuses on feature extraction, the development of the system's
input representation. Since the reinforcement feedback is not an
error signal for individual system elements, it gives little guidance
for feature extraction. One reason is that if the system fails by
choosing the wrong action, the feedback does not specify which of the
output nodes was wrong. In a system which chooses its action by
selecting the most active output node, an error can be caused by
either having a node be too active for a given input, or by other
nodes not being active enough.
If the system has a hidden layer of feature detectors, another reason
for poor feature extraction is that acting properly depends on both
identifying the current context as well as selecting an action
appropriate to that context. A scalar feedback signal does not
indicate which of these processes is at fault. The feedback does not
distinguish between the case where the system rightly identified its
context but selected the wrong response, and the case where the
system's learned responses are correct, but its feature detectors
misidentified the context. In terms of a typical neural network
implementation, the system needs to know whether it should tune its
feature detectors, or the weights placed on the outputs of those
feature detectors, or both. The sparse reinforcement signal does not
furnish this information. Thus learning methods for reinforcement
learning problems may need bottom-up information or some type of
internal feedback to supplement the top-down feedback.