Young Wu's Homepage

Prev: L1, Next: L3

Zoom: Link, TopHat: Link, GoogleForm: Link, Piazza: Link, Feedback: Link.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# Multi-Layer Perceptron

📗 A single perceptron (with possibly non-linear activation function) still produces a linear decision boundary (the two classes are separated by a line).

📗 Multiple perceptrons can be combined in a way that the output of one perceptron is the input of another perceptron:

➩ \(a^{\left(1\right)} = g\left(w^{\left(1\right)} x + b^{\left(1\right)}\right)\),

➩ \(a^{\left(2\right)} = g\left(w^{\left(2\right)} a^{\left(1\right)} + b^{\left(2\right)}\right)\),

➩ \(a^{\left(3\right)} = g\left(w^{\left(3\right)} a^{\left(2\right)} + b^{\left(3\right)}\right)\),

➩ \(\hat{y} = 1\) if \(a^{\left(3\right)} \geq 0\).

# Neural Networks

📗 Multi-Layer Perceptrons are also called Artificial Neural Networks (NNs): Link, Wikipedia.

➩ Human brain: 100,000,000,000 neurons, each neuron receives input from 1,000 other neurons.

➩ An impulse can either increases or decrease the probability of nerve pulse firing (activation of neuron).

📗 Universal Approximation Theorem: Wikipedia.

➩ A 2-layer network (1 hidden layer) can approximate any continuous function arbitrarily closely with enough hidden units.

➩ A 3-layer network (2 hidden layers) can approximate any function arbitrarily closely with enough hidden units.

TopHat Discussion

📗 Try different combinations of activation function and network architecture (number of layers and units in each layer), and compare which ones are good for the spiral dataset: Link.

# Fully Connected Network

📗 The perceptrons will be organized in layers: \(l = 1, 2, ..., L\) and \(m^{\left(l\right)}\) units will be used in layer \(l\).

➩ \(w_{j k}^{\left(l\right)}\) is the weight from unit \(j\) in layer \(l - 1\) to unit \(k\) in layer \(l\), and in the output layer, there is only one unit, so the weights are \(w_{j}^{\left(L\right)}\).

➩ \(b_{j}^{\left(l\right)}\) is the bias for unit \(j\) in layer \(l\), and in the output layer, there is only one unit, so the bias is \(b^{\left(L\right)}\).

➩ \(a_{i j}^{\left(l\right)}\) is the activation for training item \(i\) unit \(j\) in layer \(l\), where \(a_{i j}^{\left(0\right)} = x_{i j}\) can be viewed as unit \(j\) in layer \(0\) (alternatively, \(a_{i j}^{\left(l\right)}\) can be viewed as internal features), and \(a_{i}^{\left(L\right)}\) is the output representing the predicted probability that \(x_{i}\) belongs to class \(1\) or \(\mathbb{P}\left\{\hat{y}_{i} = 1\right\}\).

📗 The way the hidden (internal) units are connected is called the architecture of the network.

📗 In a fully connected network, all units in layer \(l\) are connected to every unit in layer \(l - 1\).

➩ \(a_{i j}^{\left(l\right)} = g\left(a_{i 1}^{\left(l - 1\right)} w_{1 j}^{\left(l\right)} + a_{i 2}^{\left(l - 1\right)} w_{2 j}^{\left(l\right)} + ... + a_{i m^{\left(l - 1\right)}}^{\left(l - 1\right)} w_{m^{\left(l - 1\right)} j}^{\left(l\right)} + b_{j}^{\left(l\right)}\right)\).

Example

📗 [1 points] The following is a diagram of a neural network: highlight an edge (mouse or touch drag from one node to another node) to see the name of the weight (highlight the same edge to hide the name). Highlight color: .

Name of input units: 4
Name of hidden layer 1 units: 3
Name of hidden layer 2 units: 2
Name of hidden layer 3 units: 0
Name of output units: 1

TopHat Quiz

(Past Exam Question) ID:

📗 [4 points] Given the following neural network that classifies all the training instances correctly. What are the labels (0 or 1) of the training data? The activation functions are LTU for all units: \(1_{\left\{z \geq 0\right\}}\). The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias

\(x_{i1}\)	\(x_{i2}\)	\(y_{i}\) or \(a^{\left(2\right)}_{1}\)
0	0	?
0	1	?
1	0	?
1	1	?

Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.

📗 Answer (comma separated vector): .

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yudong Chen, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link

Prev: L1, Next: L3

Last Updated: November 03, 2025 at 1:03 PM