Young Wu's Homepage

Prev: M6 Next: M8
Back to week 3 page: Link

# Warning: this is a replica of the homework page for testing purposes, please use M7 for homework submission.

# M7 Written (Math) Problems

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key)

📗 You can also load from your saved file
and click .

📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.

📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.

📗 Please do not refresh the page: your answers will not be saved.

# Warning: please enter your ID before you start!

# Question 1

# Question 2

# Question 3

# Question 4

# Question 5

# Question 6

# Question 7

# Question 8

# Question 9

# Question 10

# Question 11

📗 [2 points] Suppose scaled dot-product attention function is used. Given query vector \(q\) = , key vectors \(k_{1}\) = , \(k_{2}\) = , calculate the attention weight of \(q\) to \(k_{1}\) and \(q\) to \(k_{2}\), separate them with comma in the answer.

Hint

Scaled dot-product attention function for vector \(q, k \in \mathbb{R}^{d}\) is \(\dfrac{q^\top k}{\sqrt{d}}\). Softmax function \(\sigma\) over a vector \(z \in \mathbb{R}^{d}\) is written as: \(\sigma\left(z_{i}\right) = \dfrac{e^{z_{i}}}{\displaystyle\sum_{j=1}^{d} e^{z_{j}}}\).

📗 Answer (comma separated vector): .

📗 [2 points] Given attention weight from \(q\) to \(k_{1}\), \(w_{1}\) = , from \(q\) to \(k_{2}\), \(w_{2}\) = , given values \(v_{1}\) = , \(v_{2}\) = , calculate the output vector.

Hint

For weight vector \(w = \begin{bmatrix} w_{1} \\ w_{2} \end{bmatrix}\) and value matrix \(V = \begin{bmatrix} v_{11} & v_{12} \\ v_{21} & v_{22} \end{bmatrix}\), output is calculated as \(w^\top V\).

📗 Answer (comma separated vector): .

📗 [3 points] Assume tokenization rule is using whitespaces between words as separator, batch the following sentences in the original order into a matrix as input to encoder stack during training time. Write down the attention mask, where \(1\) = attented, \(0\) = masked.

Sentence: \(s_{1}\) = "", \(s_{2}\) = "", \(s_{3}\) = "".

Hint

\(b\) sentences, within which the longest sentences contains \(m\) tokens, can be batched into matrix \(\in \mathbb{R}^{b \times m}\) with embedding dimension \(d\). For each sentence shorter than \(m\), padding token denoted by [PAD] is added to the end of the sentence until length is \(m\). In attention matrix \(A \in \mathbb{R}^{b \times m}\), \(A_{i j} = 1\) if word \(j\) in sentence \(i\) is an actual word, \(A_{i j} = 0\) if word \(j\) in sentence \(i\) is [PAD].

📗 Answer (matrix with multiple lines, each line is a comma separated vector):

📗 [4 points] From the following options, write down which vectors are used in the computation of the following matrices?

Hint

Encoder self-attention performs attention over source embedding, QKV are all calculated from source embedding. Decoder self-attention performs attention over target embedding, QKV are all calculated from target embedding. Decoder encoder-decoder self-attention perform cross-attention from both source and target embedding, where Q is computed out of target embedding, KV is computed out of source embedding.

📗 Answer:

uses
uses
uses
uses

📗 [4 points] For the following models, what are their basic structures? Select from the options.

Hint

See slides.

📗 Answer:

is
is
is
is

📗 [4 points] Perform hierarchical clustering with linkage in one-dimensional space on the following points: , , , , , . Break ties in distances by first combining the instances with the smallest index (appears earliest in the list). Draw the cluster tree.

📗 Note: the node \(C_{1}\) should be the first cluster formed, \(C_{2}\) should be the second and so on. All edges to point to the instances (or other clusters) that belong to the cluster.

Hint

See Fall 2019 Midterm Q20, Spring 2018 Midterm Q5, Fall 2017 Final Q17, Fall 2016 Midterm Q10, Fall 2016 Final Q8, Fall 2014 Midterm Q1, Fall 2012 Final Q2, Fall 2010 Final Q12, Fall 2006 Midterm Q4. Start with 6 clusters with one point each. (1) Find the two clusters that are the closest to each other (measure the distance between two clusters by either the smallest pairwise distance of points in the clusters (single linkage) or the largest pairwise distance (complete linkage). Draw edges from a new cluster node \(C_{i}\) to these two existing clusters (or instances). (2) Repeat (1) until all all instances are in one cluster.

📗 Answer:

📗 Note: to erase an edge, draw the same edge again.

📗 Note: the node \(C_{1}\) should be the first cluster formed, \(C_{2}\) should be the second and so on. All edges to point to the instances (or other clusters) that belong to the cluster.

Hint

📗 Answer:

📗 Note: to erase an edge, draw the same edge again.

📗 [4 points] You are given the distance table. Consider the next iteration of hierarchical clustering using linkage. What will the new values be in the resulting distance table corresponding to the new clusters? If you merge two columns (rows), put the new distances in the column (row) with the smaller index. For example, if you merge columns 2 and 4, the new column 2 should contain the new distances and column 4 should be removed, i.e. the columns and rows should be in the order (1), (2 and 4), (3).

\(d\) =

Hint

See Spring 2017 Midterm Q4. The resulting matrix should have 4 columns and 4 rows. Find the smallest non-zero number in the pair-wise distance matrix, suppose row \(i\) and column \(j\), merge columns \(i\) and \(j\) and rows \(i\) and \(j\) at the same time: for single linkage, take the minimum of the numbers in the two rows and columns; for complete linkage, take the maximum.

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .

📗 [3 points] Given three clusters, \(A\) = {, }, \(B\) = {\(x\)}, \(C\) = {, }. Find a value of \(x\) so that \(A\) and \(B\) will be merged in the next iteration of single linkage hierarchical clustering, and \(B\) and \(C\) will be merged in the next iteration of complete linkage hierarchical clustering. Break ties by merging with the cluster with the smaller index (i.e. \(A\), then \(B\), then \(C\)).

📗 Note: there can be multiple answers, including non-integer answers, enter one of them. If there are none, enter 0.

Hint

Compute the single linkage and complete linkage distances between the three clusters as a function of x, then write down an inequality that describes when B is closer to A vs when B is closer to C.

📗 Answer: .

📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the questions that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.

📗 Answer: .

# Grade

* * * * *

* * * * *

📗 You could save the text in the above text box to a file using the button or copy and paste it into a file yourself .

📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##m: 7" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.

Last Updated: November 21, 2025 at 11:40 PM