CS 540 Lecture Notes: Neural Networks

University of Wisconsin - Madison

CS 540 Lecture Notes

C. R. Dyer

Neural Networks (Chapter 18.6.3 - 18.7)

Main Ideas

Neural Networks (NNs) also known as Artificial Neural Networks (ANNs), Connectionist Models, and Parallel Distributed Processing (PDP) Models
"`Artificial Neural Networks' are massively parallel interconnected networks of simple (usually adaptive) elements and their hierarchical organizations which are intended to interact with the objects of the real world in the same way as biological nervous systems do." -- T. Kohonen
Fine-grained, parallel, distributed computing model characterized by
- A large number of very simple, neuron-like processing elements called units, PEs, or nodes
- A large number of weighted, directed connections between pairs of units
- Weights may be positive or negative real values
- Local processing in that each unit computes a function based on the outputs of a limited number of other units in the network
- Each unit computes a simple function of its input values, which are the weighted outputs from other units. If there are n inputs to a unit, then the unit's output, or activation is defined by a = g((w1 * x1) + (w2 * x2) + ... + (wn * xn)). Thus each unit computes a (simple) function g of the linear combination of its inputs.

Some Common Definitions of What a Unit Computes

Step (also called Linear Threshold Unit (LTU))

    |         n
    |       -----
    |       \
    | 1, if /     w x  >= t
    |       -----  i i
a = |        i=1
    |
    | 0, otherwise
    |

Sigmoid

          1
a = ----------------
               n
             -----
            _\
	     /    w x
	     ----- i i
              i=1
        1 + e

Why Neural Nets?

The autonomous local processing of each individual unit combines with similar simple behavior of many other units to produce "interesting," complex, global behavior
Intelligent behavior is an "emergent" phenomenon
Solving problems using a processing model that is similar to the brain may lead to solutions to complex information processing problems that would be difficult to achieve using traditional symbolic approaches in AI
Associative memory access is directly represented. Hence pattern-directed retrieval and matching operations are promoted.
Robust computation because knowledge is distributed and continuous rather than discrete or digital. Knowledge captured in a large number of fine-grained units and can match noisy and incomplete data
Fault tolerant architecture because computations can be organized so as not to depend on a fixed set of units and connections

Neurobiology Constraints on Human Information Processing

Number of neurons: 10¹¹
Number of connections: 10⁴ per neuron
Neuron death rate: 10⁵ per day
Neuron birth rate: 0
Connection birth rate: very slow
Performance: about 10² msec, or about 100 sequential neuron firings for "many" tasks

Perceptrons

Simplest "interesting" class of neural networks
"1 layer" network -- i.e., one input layer and one output layer. In most basic form, output layer consists of just one unit.
Linear threshold unit (LTU) used at output layer node(s)
Threshold associated with LTUs can be considered as another weight. That is, by definition of an LTU, output of a unit is defined by w1*x1 + w2*x2 + ... + wn*xn > t where t is a threshold value. But this is algebraically equivalent to w1*x1 + w2*x2 + ... + wn*xn + t*(-1) > 0. So, in an implementation, consider each LTU as having an extra input which has a constant input value of -1 and the arc's weight is t. Now each unit has a fixed threshold value of 0, and t is an extra weight called the bias.

So, from here on learning the weights in a neural net will mean learning the weights and the threshold values.
Examples
- OR function
  Two inputs, both binary (0 or 1), and output binary (since output unit is an LTU):
- AND function

Learning in Neural Nets

Programmer specifies numbers of units in each layer and connectivity between units, so the only unknown is the set of weights associated with the connections
Supervised learning of weights from a set of training examples, given at I/O pairs. I.e., an example is a list of values defining the values of the input units in a given network. The output of the network is a list of values output by the output units
Algorithm:
1. Initialize the weights in the network (usually with random values)
2. repeat until stopping criterion is met
  - foreach example e in training set do
    1. O = neural-net-output(network, e)
    2. T = desired (i.e, teacher) output
    3. update-weights(e, O, T)
Note: Each pass through all of the training examples is called one epoch
Perceptron Learning Rule
In a Perceptron, we define the update-weights function in the learning algorithm above by the formula:

wi = wi + delta_wi
where

delta_wi = alpha * (T - O) xi
xi is the input associated with the ith input unit. alpha is a constant between 0 and 1 called the learning rate.
Notes about this update formula:
- Based on a basic idea due to Hebb that the strength of a connection between two units should be adjusted in proportion to the product of their simultaneous activations. A product is used as a means of measuring the correlation between the values output by the two units.
- Also called the Delta Rule or the Widrow-Hoff Rule
- "Local" learning rule in that only local information in the network is needed to update a weight
- Performs gradient descent in "weight space" in that if there are n weights in the network, this rule will be used to iteratively adjust all of the weights so that at each iteration (training example) the error is decreasing (more correctly, the error is monotonically non-increasing)
- Correct output (T = O) causes no change in a weight
- xi = 0 causes no change in weight
- Does not depend on wi
- If T=1 and O=0, then increase the weight so that hopefully next time the result will exceed the threshold at the output unit and cause the output O to be 1
- If T=0 and O=1, then decrease the weight so that hopefully next time the result will be below the threshold and cause the output to be 0.

Example: Learning OR in a Perceptron

Given initial network defined as:
Let the learning rate parameter be alpha = 0.2
Let the threshold be specified as a third weight, w3 with constant input value x3 = -1

The result of executing the learning algorithm for 3 epochs:

x1	x2	T	O	delta_w1	w1	delta_w2	w2	delta_w3	w3 (=t)
-	-	-	-	-	.1	-	.5	-	.8
0	0	0	0	0	.1	0	.5	0	.8
0	1	1	0	0	.1	.2	.7	-.2	.6
1	0	1	0	.2	.3	0	.7	-.2	.4
1	1	1	1	0	.3	0	.7	0	.4
0	0	0	0	0	.3	0	.7	0	.4
0	1	1	1	0	.3	0	.7	0	.4
1	0	1	0	.2	.5	0	.7	-.2	.2
1	1	1	1	0	.5	0	.7	0	.2
0	0	0	0	0	.5	0	.7	0	.2
0	1	1	1	0	.5	0	.7	0	.2
1	0	1	1	0	.5	0	.7	0	.2
1	1	1	1	0	.5	0	.7	0	.2

So, the final learned network is:

Linear Separability

Perceptron output (assuming a single output unit) determined by the separating hyperplane defined by w1*x1 + w2*x2 + ... + wn*xn = t
So, Perceptrons can only learn functions that are linearly separable
Example
- The weights in the initial network for the OR function given above defines a separating line (.1 * x1) + (.5 * x2) - .8 = 0. Rewriting, this is equivalent to the line defined by x2 = (-.2 * x1) + 1.6, that is a line with slope -.2 and x2-intercept 1.6.
- The weights in the final network for the OR function define a separating line .5 * x1) + (.7 * x2) - .2 = 0, or x2 = (-.7 * x1) + .3
- The initial and final separating lines can be shown graphically in terms of the two-dimensional space of possible inputs as follows. Notice that the initial separating line classifies all four examples as 0, but the final separating line correctly has all the examples with target value 0 on one side and all the examples with target value 1 on the other side.
In general, the goal of learning in a Perceptron is to adjust the separating line by modifying the weights until all of the examples with target value 1 are on one side of the line, and all of the examples with target value 0 are on the other side of the separating line, where the "line" is in general a hyperplane in an n-dimensional space, and n is the number of input units.

XOR - A Function that Can Not be Learned by a Perceptron

The Exclusive OR function can be shown graphically as follows, where + corresponds to an output of 1, and - corresponds to a desired output of 0.
Clearly from the figure it is evident that there does not exist a line that can separate the two classes. Hence, XOR is not a linearly-separable function, and cannot be learned by a Perceptron.
Other examples of functions that are not linearly-separable: parity and "determining if all of the 1s in a 2D binary array are mutually connected to one another by paths of 1s"

Perceptron Convergence Theorem

If a set of examples are learnable (i.e., 100% correct classification), the Perceptron Learning Rule will find the necessary weights (Minksy and Papert, 1988)
- In a finite number of steps
- Independent of the initial weights
- The Perceptron Learning Rule does gradient descent search in weight space, so this theorem says that if a solution exists, gradient descent is guaranteed to find an optimal (i.e., 100% correct classification) solution for any 1-layer neural network

Beyond Perceptrons

Perceptrons are too weak a computing model because they can only learn linearly-separable functions
Define more complex neural networks in order to enhance their functionality
Multi-layer, feedforward networks generalize 1-layer networks (i.e., Perceptrons) to n-layer networks as follows:
- Partition units into n+1 "layers," such that layer 0 contains the input units, layers 1, ..., n-1 are the hidden layers, and layer n contains the output units
- Each unit in layer k, k=0, ..., n-1, is connected to all of the units in layer k+1
- Connectivity means bottom-up connections only, with no cycles, hence the name "feedforward" nets
- Programmer defines the number of hidden layers and the number of units in each layer (input, hidden, and output)
- Example of a 2-layer feedforward network:
2-layer feedforward neural nets (i.e., nets with one hidden layer) with an LTU at each hidden and output layer unit, can compute functions associated with classification regions that are convex regions in the space of possible input values. That is, each unit in the hidden layer acts like a Perceptron, learning a separating line. Together, all of the hidden units' separating lines are combined by the output unit(s) to classify an example by intersecting all of the half-planes defined by the separating lines.
3-layer feedforward neural nets (i.e., nets with two hidden layers) with an LTU at each hidden and output layer unit, can compute arbitrary functions (hence they are universal computing devices) although the complexity of the function is limited by the number of units in the network.
Recurrent networks are multi-layer networks in which cyclic (i.e., feedback) connections are allowed. In particular, if the output of a unit is connected to the other units in its own layer, then this special case of cyclic connections define networks with mutual inhibition links.

Computing XOR using a 2-Layer Feedforward Network

The following network computes XOR. Notice that the left hidden unit effectively computes OR and the right hidden unit computes AND. Then the output unit outputs 1 if the OR output is 1 and the AND output is 0.

Backpropagation Learning in Feedforward Neural Nets

Method for learning weights in feedforward networks due to Rumelhart, Hinton, and Williams, 1986, which generalizes the Delta Rule
Cannot use the Perceptron Learning Rule for learning in Feedforward Nets because for hidden units we don't have teacher (i.e., desired) values
Must solve the Credit Assignment Problem -- i.e., when there is an error at an output unit (i.e., a non-zero difference between the actual output of the unit and the teacher output), which weights in the network should be updated, and how to update them? In other words, how to assign credit/blame for the output error to the weights in the network?
Backpropagation approach: gradient-descent algorithm to minimize the error on the training data by propagating errors backwards through the network starting at the output units and working backwards towards the input units
Algorithm
1. Initialize the weights in the network (often randomly)
2. repeat
  - foreach example e in the training set do
    1. O = neural-net-output(network, e) ; forward pass
    2. T = teacher output for e
    3. Calculate error (T - O) at the output units
    4. Compute delta_wi for all weights from hidden layer to output layer ; backward pass
    5. Compute delta_wi for all weights from input layer to hidden layer ; backward pass continued
    6. Update the weights in the network
  - end
3. until all examples classified correctly or stopping criterion satisfied
4. return(network)
More specifically, consider the case where there is a network containing one hidden layer. Output units are specified using index i, hidden units using j, and input units using k. y_i is the desired (teacher) output and a_i is the actual output of the output unit i. Then the backprop algorithm can be given by:
```
1. in_j = sum_k(w_k,j * a_k)				// Forward pass starts. Compute weighed input to all hidden units
2. a_j = sigmoid(in_j)					// Compute outputs at all hidden units
3. in_i = sum_j(w_j,i * a_j)				// Compute weighed inputs to all output units
4. a_i = sigmoid(in_i)					// Compute outputs at all output units
5. del_i = a_i * (1 - a_i) * (y_i - a_i)		// Backward pass starts.  Compute "modified error" at output units
6. del_j = a_j * (1 - a_j) * sumi(w_j,i * del_i)	// Compute "modified error" at all hidden units
7. w_j,i = w_j,i + (alpha * a_j * del_i)		// update weights between hidden and output units
8. w_k,j = w_k,j + (alpha * a_k * del_j)		// update weights between input and hidden units
```
Backpropagation algorithm performs gradient descent search in weight space for learning network weights. That is, given a network, there are a fixed number of connections with associated weights. Say there are n weights, then each configuration of weights that defines an instance of the network is a vector, W, of length n. W can be considered to be a point in an n-dimensional weight space, where each axis is associated with one of the connections in the network. Given a training set of m examples, each network defined by the vector W has an associated error, E, indicating the total error on all of the training data. We will define the error E as:

E = E1 + E2 + ... + Em
where Ei is the mean squared error (MSE) of the network on the ith training example, computed as follows:

((T1 - O1)² + (T2 - O2)² + ... + (Tn - On)²)/2
where Ti is the target value for the ith example, and Oi is the network output value for the ith example, and there are n output units in the network.
Now consider the n+1 dimensional space where n dimensions are the weights, and the last dimension is the error, E. The error is non-negative and defines a surface in this weight space as shown below:

So, the goal is to search for the point in weight space with (global) minimum mean squared error E
To simplify the search, we'll use gradient descent to locally move in weight space, at each iteration, so as to change the weights so that the error decreases the fastest, and halt when we're at a (local) minimum in E. The gradient is a vector that indicates the steepest rate of increase in a function, so we'll change the weights in the opposite direction, which decreases E the fastest. That is, compute

Grad_E = [dE/dw1, dE/dw2, ..., dE/dwn]
and then change the ith weight by

delta_wi = - alpha * dE/dwi
Computing derivatives required for calculating the gradient direction requires an activation function that is continuous, differentiable, non-decreasing and easily computed. Consequently, we can't use the LTE function. Instead, we'll use the sigmoid function at each unit (except the input units, of course).

Computing the Gradient of E

We'll consider the problem of a 2-layer network where we must update the weights connecting nodes in the hidden layer to the output layer, and the weights connecting the nodes in the input layer to the hidden layer

Updating Hidden-to-Output Unit Weights
Since we have the teacher-supplied values for the desired values output by the Output units, we can use a generalized version of the Perceptron Learning Rule to update these weights. Because a sigmoid unit is used here instead of an LTE unit, we obtain the update formula:
```
          delta_wji = alpha * aj * (Ti - Oi) * g'(in_i)
                    = alpha * aj * (Ti - Oi) * Oi * (1 - Oi)
```
where weight wji connects hidden unit j to output unit i, alpha is the learning rate parameter, Ti is the teacher output associated with output unit i, Oi is the actual output of output unit i, aj is the output of hidden unit j, and g' is the derivative of the sigmoid activation function, which is known to be g' = g(1-g).
Updating Input-to-Hidden Unit Weights
Because we don't have teacher-supplied values for the correct outputs at the hidden units, we must infer the error at these units by "backpropagating the errors at the output units. Each hidden unit is connected to a given output unit, so an error at an output unit should be "distributed" to each of the hidden units; we do this in proportion to the weight of the connection from a hidden unit to the given output unit. In this way the total error is distributed to all of the hidden units that contributed to that error. In this way, each hidden unit will accumulate some error from each of the output units that it is connected to. The total error is the sum of the errors propagated back from the output units. Specifically, we get the following formula for updating the weight wkj on the connection from input unit k to hidden unit j:
```
           delta_wkj = alpha * Ik * aj * (1 - aj) * SUM [ ai(1 - ai)(Ti - ai)wji]
```

Other Issues

How to Set Alpha, the Learning Rate Parameter?
Use a tuning set or cross-validation to train using several candidate values for alpha, and then select the value that gives the lowest error
How to Estimate the Error?
Use cross-validation (or some other evaluation method) multiple times with different random initial weights. Report the average error rate.
How many Hidden Layers and How many Hidden Units per Layer?
Usually just one hidden layer is used (i.e., a 2-layer network). How many units should it contain? Too few => can't learn. Too many => poor generalization. Determine experimentally using a tuning set or cross-validation to select number that minimizes error.
How many examples in the Training Set?
Under what circumstances can I be assured that a net that is trained to classify 1 - e/2 of the training set correctly, will also classify 1 - e of the testing set correctly? Clearly, the larger the training set the better the generalization, but the longer the training time required. But to obtain 1 - e correct classification on the testing set, training set should be of size approximately n/e, where n is the number of weights in the network and e is a fraction between 0 and 1. For example, if e=.1 and n=80, then a training set of size 800 that is trained until 95% correct classification is achieved on the training set, should produce 90% correct classification on the testing set.
When to Stop?
Too much training "overfits" the data, and hence the error rate will go up on the testing set. Hence it is not usually advantageous to continue training until the MSE is minimized. Instead, train the network until the error rate on a tuning set starts to increase.

Summary

Advantages
- Parallel processing
- Distributed representations
- Online (i.e., incremental) algorithm
- Simple computations
- Robust with respect to noisy data
- Robust with respect to node failure
- Empirically shown to work well for many problem domains
Disadvantages
- Slow training
- Poor interpretability
- Network topology layouts ad hoc
- Hard to debug because distributed representations preclude content checking
- May converge to a local, not global, minimum of error
- Not known how to model higher-level cognitive mechanisms
- May be hard to describe a problem in terms of features with numerical values

Applications

NETtalk (Sejnowski and Rosenberg, 1987)
Maps character strings into phonemes for learning speech from text.
Neurogammon (Tesauro and Sejnowski, 1989)
Backgammon learning program
Speech recognition (Waibel, 1989)
Converts sound to text
Character recognition (Le Cun et al., 1989)
Face Recognition (Mitchell)
ALVINN (Pomerleau, 1988)
- Control vehicle steering direction so as to follow road by staying in the middle of its lane.
- 2-layer feedforward network using backpropagation for learning
- Raw input is a 480 x 512 pixel image 15 times per second. Output is one of 30 discrete steering positions from hard-left to hard-right
- Input color image is preprocessed to obtain a reduced resolution image containing 30 x 32 pixels, where each pixel is an integer from 0 to 255 (i.e., one byte per pixel) representing, approximately, the brightness of the pixel. These 960 pixels values define the input to the 960 input units.
- Complete connectivity from each input unit to each of four (4) hidden units in the single hidden layer
- Complete connectivity from each hidden unit to each of 30 output units, where output unit 1 represents steering sharp-left, output unit 15 represents steering straight ahead, and output unit 30 represents steering sharp-right
- Teacher output for an example image is a set of 30 output values at the 30 output units, Gaussian distributed centered on the desired steering direction d. More specifically, given the desired direction d and a Gaussian distribution with variance 10, the desired output at output unit i is
  
  Oi = e^{[ -(i - d)² / 10]}
- Given actual output values from the 30 output units, a least-squares best fit of a Gaussian with variance 10 is computed. The peak of this Gaussian corresponds to the output steering direction computed by the network. The difference between this output steering direction and the teacher steering direction is the error.
- ALVINN continuously learned on the fly as the vehicle traveled, initially "observing" a human driver and later observing its own driving. The current steering position for each input image is saved as the teacher value for that image. One major problem with training using "real data" --- no "negative" examples are presented to the system assuming the human driver and later the neural network driver never veer off the road. Another major problem is that continuous training may cause the network to overfit the data in recent images at the expense of forgetting old images. So, for example, driving on a long straight road may cause the network to forget how to follow curvy roads.
  To solve both of these problems, ALVINN takes each input image and computes other views of the road by performing various perspective transformations (shift, rotate, and fill in missing pixels) so as to simulate what the vehicle would be seeing if its position and orientation on the road was not correct. For each of these synthesized views of the road, a "correct" steering direction is approximated. The real and the synthesized images are then used for training the network.
  To avoid overfitting using just the most recent images captured, ALVINN maintains a buffer pool of 200 images (both real and synthetic). When a new image is obtained, it replaces one of the images in the buffer pool so that the average steering direction of all 200 examples is straight ahead. In this way, the buffer pool always keeps some images in many different steering directions.
  Initially, a human driver controls the vehicle for about 5 minutes while the network learns weights starting from initial random weights. After that one epoch of training using the 200 examples in the buffer pool is performed approximately every 2 seconds.

x1	x2	T	O	delta_w1	w1	delta_w2	w2	delta_w3	w3 (=t)
-	-	-	-	-	.1	-	.5	-	.8
0	0	0	0	0	.1	0	.5	0	.8
0	1	1	0	0	.1	.2	.7	-.2	.6
1	0	1	0	.2	.3	0	.7	-.2	.4
1	1	1	1	0	.3	0	.7	0	.4
0	0	0	0	0	.3	0	.7	0	.4
0	1	1	1	0	.3	0	.7	0	.4
1	0	1	0	.2	.5	0	.7	-.2	.2
1	1	1	1	0	.5	0	.7	0	.2
0	0	0	0	0	.5	0	.7	0	.2
0	1	1	1	0	.5	0	.7	0	.2
1	0	1	1	0	.5	0	.7	0	.2
1	1	1	1	0	.5	0	.7	0	.2

x1	x2	T	O	delta_w1	w1	delta_w2	w2	delta_w3	w3 (=t)
-	-	-	-	-	.1	-	.5	-	.8
0	0	0	0	0	.1	0	.5	0	.8
0	1	1	0	0	.1	.2	.7	-.2	.6
1	0	1	0	.2	.3	0	.7	-.2	.4
1	1	1	1	0	.3	0	.7	0	.4
0	0	0	0	0	.3	0	.7	0	.4
0	1	1	1	0	.3	0	.7	0	.4
1	0	1	0	.2	.5	0	.7	-.2	.2
1	1	1	1	0	.5	0	.7	0	.2
0	0	0	0	0	.5	0	.7	0	.2
0	1	1	1	0	.5	0	.7	0	.2
1	0	1	1	0	.5	0	.7	0	.2
1	1	1	1	0	.5	0	.7	0	.2

x1	x2	T	O	delta_w1	w1	delta_w2	w2	delta_w3	w3 (=t)
-	-	-	-	-	.1	-	.5	-	.8
0	0	0	0	0	.1	0	.5	0	.8
0	1	1	0	0	.1	.2	.7	-.2	.6
1	0	1	0	.2	.3	0	.7	-.2	.4
1	1	1	1	0	.3	0	.7	0	.4
0	0	0	0	0	.3	0	.7	0	.4
0	1	1	1	0	.3	0	.7	0	.4
1	0	1	0	.2	.5	0	.7	-.2	.2
1	1	1	1	0	.5	0	.7	0	.2
0	0	0	0	0	.5	0	.7	0	.2
0	1	1	1	0	.5	0	.7	0	.2
1	0	1	1	0	.5	0	.7	0	.2
1	1	1	1	0	.5	0	.7	0	.2