Young Wu's Homepage

Prev: L13, Next: L15
Course Links: Canvas, Piazza, TopHat (212925)
Zoom Links: MW 4:00, TR 1:00, TR 2:30.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# The materials covered in this lecture will not be on the exams.

# Hopfield Networks

📗 The Nobel Prize in Physics 2024 was awarded to John J. Hopfield and Geoffrey E. Hinton "for foundational discoveries and inventions that enable machine learning with artificial neural networks": Link

📗 Hopfield network is a single layer recurrent network, inspired by biology and physics, to minic associative memory. Memories are stored in a landscape (given the network weights): PDF.

➩ All units are connected with all other units with weights (symmetric).

➩ All units have binary activation values, usually \(-1\) and \(+1\).

📗 Activation: given the weights and an initial set of activations (noisy input), the activation values can be updated to minimize the energy \(E = - \dfrac{1}{2} \displaystyle\sum_{i j} w_{ij} a_{i} a_{j} - \displaystyle\sum_{i} b_{i} a_{i}\) to recover the memory: Link.

➩ In every iteration, some \(a_{i}\) is randomly chosen and updated to \(+1\) if \(\displaystyle\sum_{j \neq i} w_{ij} a_{j} + b_{i} > 0\) and \(-1\) otherwise (change or "derivative" of energy with respect to a change in \(a_{i}\) is \(w_{ij} a_{j} + b_{i}\)).

➩ The process is similar to stochastic gradient descent to find a local minimum of the energy.

📗 Training: the weights \(w_{ij}\) are trained to minimize energy given one or more sets of activation values (representing the things to remember).

➩ When \(a_{i}\) and \(a_{j}\) have the same sign, want \(w_{ij}\) to be positive.

➩ When \(a_{i}\) and \(a_{j}\) have opposite signs, want \(w_{ij}\) to be negative.

➩ During training, \(w_{ij}\) is set to \(a_{i} a_{j}\) (or the sum of these products for multiple sets of activation values).

➩ The biases are not updated.

📗 Boltzmann machine is another energy-based model with a hidden layer: PDF, Wikipedia.

📗 Restricted Boltzmann machine does not have connected between units in the same layer: Wikipedia.

# Generative Adversarial Networks

📗 Generative adversarial networks (GAN) use two neural networks: Link, Wikipedia.

➩ The generator (deconvolutional neural network): input is noise, output is fake images (or text).

➩ The discriminator (convolutional neural network): input the fake or real image, output is binary class whether the image is real.

➩ The two networks are trained together, or competing in a zero-sum game: the generator tries to maximize the discriminator loss, and the discriminator tries to minimize the same loss: Wikipedia.

➩ The solution to the zero-sum game, called the Nash equilibrium, is the solution to \(\displaystyle\min_{w^{\left(G\right)}} \displaystyle\max_{w^{\left(D\right)}} C\left(w^{\left(G\right)}, w^{\left(D\right)}\right)\), where \(C\) is the loss of the discriminator.

📗 Example (deepfake) faces generated by GAN: Link, Link, Wikipedia.

📗 Variational auto-encoder is another generative model that maps input into a lowest dimensional latent space (encoder) that can be used to reconstruct the original input (decoder), similar to PCA.

📗 Diffusion model is a generative model that gradually add or remove noise (adding noise is similar to encoder, removing noise is similar to decoder): Wikipedia.

# Transformers

📗 No recurrent units are used: "attention is all you need": Link, Link, Wikipedia.

📗 Transformer is a neural network with attention mechanism and without reccurent units: Wikipedia.

📗 Attention units keep track of which parts of the sentence is important and pay attention to, for example, scaled dot product attention units: Wikipedia.

➩ \(a^{\left(h\right)}_{t} = g\left(w^{\left(x\right)} \cdot x_{t} + b^{\left(x\right)}\right)\) is not recurrent.

➩ There are value units \(a^{\left(v\right)}_{t} = w^{\left(v\right)} \cdot a^{\left(h\right)}_{t}\), key units \(a^{\left(k\right)}_{t} = w^{\left(k\right)} \cdot a^{\left(h\right)}_{t}\), and query units \(a^{\left(q\right)}_{t} = w^{\left(q\right)} \cdot a^{\left(h\right)}_{t}\), and attention context can be computed as \(g\left(\dfrac{a^{\left(q\right)}_{s} \cdot a^{\left(k\right)}_{t}}{\sqrt{m}}\right) \cdot a^{\left(v\right)}_{t}\) where \(g\) is the softmax activation: here \(a^{\left(q\right)}_{s}\) represents the first word, \(a^{\left(k\right)}_{t}\) represents the second word, \(a^{\left(q\right)}_{s} \cdot a^{\left(k\right)}_{t}\) is the dot product (which represents the cosine of the angle between the two words, i.e. how similar or related the two words are), and \(a^{\left(v\right)}_{t}\) is the value of the second word to send to the next layer.

➩ The attention matrix is usually masked so that a unit \(a_{t}\) cannot pay attention to another unit in the future \(a_{t+1}, a_{t+2}, a_{t+3}, ...\) by making the \(a^{\left(q\right)}_{s} \cdot a^{\left(k\right)}_{t} = -\infty\) when \(s \geq t\) so that \(e^{a^{\left(q\right)}_{s} \cdot a^{\left(k\right)}_{t}} = 0\) when \(s \geq t\).

➩ There can be multiple parallel attention units called multi-head attention.

📗 Positional encoding are used so that features contain information about the word token and its position.

📗 Layer normalization trick is used so that means and variances of the units between attention layers and fully connected layers stay the same.

# Large Language Models

📗 GPT Series (Generative Pre-trained Transformer): Link, Wikipedia.

➩ Fine tuning for pre-trained models freezes a subset of the weights (usually in earlier layers) and updates the other weights (usually the later layers) based on new datasets: Wikipedia.

➩ Prompt engineering does not alter the weights, and only provides more context (in the form of examples): Link, Wikipedia.

➩ Reinforcement learning from human feedback (RLHF) uses reinforcement learning techniques to update the model based on human feedback (in the form of rewards, and the model optimizes the reward instead of some form of costs): Wikipedia.

📗 BERT (Bidirectional Encoder Representations from Transformers): Wikipedia.

📗 Many other large language models developed by different companies: Link.

📗 Adversarial attacks on LLM can in the form of prompt injection attacks: Link.

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form.

Prev: L13, Next: L15

Last Updated: July 01, 2025 at 1:47 AM