Prev: M6 Next: M8
Back to week 3 page: Link


# M7 Written (Math) Problems

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key)
📗 You can also load from your saved file
and click .
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The official deadline is July 10, late submissions within a week will be accepted without penalty, but please submit a regrade request form: Link.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end. 
📗 Please do not refresh the page: your answers will not be saved.
📗 Please report any bugs on Piazza: Link

# Warning: please enter your ID before you start!


# Question 1



# Question 2



# Question 3



# Question 4



# Question 5



# Question 6



# Question 7



# Question 8



# Question 9



# Question 10



# Question 11



📗 [2 points] Suppose scaled dot-product attention function is used. Given query vector \(q\) = , key vectors \(k_{1}\) = , \(k_{2}\) = , calculate the attention weight of \(q\) to \(k_{1}\) and \(q\) to \(k_{2}\), separate them with comma in the answer.
Hint Scaled dot-product attention function for vector \(q, k \in \mathbb{R}^{d}\) is \(\dfrac{q^\top k}{\sqrt{d}}\). Softmax function \(\sigma\) over a vector \(z \in \mathbb{R}^{d}\) is written as: \(\sigma\left(z_{i}\right) = \dfrac{e^{z_{i}}}{\displaystyle\sum_{j=1}^{d} e^{z_{j}}}\).
📗 Answer (comma separated vector): .
📗 [2 points] Given attention weight from \(q\) to \(k_{1}\), \(w_{1}\) = , from \(q\) to \(k_{2}\), \(w_{2}\) = , given values \(v_{1}\) = , \(v_{2}\) = , calculate the output vector.
Hint For weight vector \(w = \begin{bmatrix} w_{1} \\ w_{2} \end{bmatrix}\) and value matrix \(V = \begin{bmatrix} v_{11} & v_{12} \\ v_{21} & v_{22} \end{bmatrix}\), output is calculated as \(w^\top V\).
📗 Answer (comma separated vector): .
📗 [3 points] What are the components of an -block in transformer model?
Hint See slides.
📗 Choices:




None of the above
📗 [3 points] Assume tokenization rule is using whitespaces between words as separator, batch the following sentences in the original order into a matrix as input to encoder stack during training time. Write down the attention mask, where \(1\) = attented, \(0\) = masked.
Sentence: \(s_{1}\) = "", \(s_{2}\) = "", \(s_{3}\) = "".
Hint \(b\) sentences, within which the longest sentences contains \(m\) tokens, can be batched into matrix \(\in \mathbb{R}^{b \times m}\) with embedding dimension \(d\). For each sentence shorter than \(m\), padding token denoted by [PAD] is added to the end of the sentence until length is \(m\). In attention matrix \(A \in \mathbb{R}^{b \times m}\), \(A_{i j} = 1\) if word \(j\) in sentence \(i\) is an actual word, \(A_{i j} = 0\) if word \(j\) in sentence \(i\) is [PAD].
📗 Answer (matrix with multiple lines, each line is a comma separated vector):
📗 [4 points] From the following options, write down which vectors are used in the computation of the following matrices?
Hint Encoder self-attention performs attention over source embedding, QKV are all calculated from source embedding. Decoder self-attention performs attention over target embedding, QKV are all calculated from target embedding. Decoder encoder-decoder self-attention perform cross-attention from both source and target embedding, where Q is computed out of target embedding, KV is computed out of source embedding.
📗 Answer:
uses
uses
uses
uses
📗 [4 points] For the following models, what are their basic structures? Select from the options.
Hint See slides.
📗 Answer:
is
is
is
is
📗 [4 points] Perform hierarchical clustering with linkage in one-dimensional space on the following points: , , , , , . Break ties in distances by first combining the instances with the smallest index (appears earliest in the list). Draw the cluster tree.
📗 Note: the node \(C_{1}\) should be the first cluster formed, \(C_{2}\) should be the second and so on. All edges to point to the instances (or other clusters) that belong to the cluster.
Hint See Fall 2019 Midterm Q20, Spring 2018 Midterm Q5, Fall 2017 Final Q17, Fall 2016 Midterm Q10, Fall 2016 Final Q8, Fall 2014 Midterm Q1, Fall 2012 Final Q2, Fall 2010 Final Q12, Fall 2006 Midterm Q4. Start with 6 clusters with one point each. (1) Find the two clusters that are the closest to each other (measure the distance between two clusters by either the smallest pairwise distance of points in the clusters (single linkage) or the largest pairwise distance (complete linkage). Draw edges from a new cluster node \(C_{i}\) to these two existing clusters (or instances). (2) Repeat (1) until all all instances are in one cluster.
📗 Answer: 



📗 Note: to erase an edge, draw the same edge again.
📗 [4 points] Perform hierarchical clustering with linkage in one-dimensional space on the following points: , , , , , . Break ties in distances by first combining the instances with the smallest index (appears earliest in the list). Draw the cluster tree.
📗 Note: the node \(C_{1}\) should be the first cluster formed, \(C_{2}\) should be the second and so on. All edges to point to the instances (or other clusters) that belong to the cluster.
Hint See Fall 2019 Midterm Q20, Spring 2018 Midterm Q5, Fall 2017 Final Q17, Fall 2016 Midterm Q10, Fall 2016 Final Q8, Fall 2014 Midterm Q1, Fall 2012 Final Q2, Fall 2010 Final Q12, Fall 2006 Midterm Q4. Start with 6 clusters with one point each. (1) Find the two clusters that are the closest to each other (measure the distance between two clusters by either the smallest pairwise distance of points in the clusters (single linkage) or the largest pairwise distance (complete linkage). Draw edges from a new cluster node \(C_{i}\) to these two existing clusters (or instances). (2) Repeat (1) until all all instances are in one cluster.
📗 Answer: 



📗 Note: to erase an edge, draw the same edge again.
📗 [4 points] You are given the distance table. Consider the next iteration of hierarchical clustering using linkage. What will the new values be in the resulting distance table corresponding to the new clusters? If you merge two columns (rows), put the new distances in the column (row) with the smaller index. For example, if you merge columns 2 and 4, the new column 2 should contain the new distances and column 4 should be removed, i.e. the columns and rows should be in the order (1), (2 and 4), (3).
\(d\) =
Hint See Spring 2017 Midterm Q4. The resulting matrix should have 4 columns and 4 rows. Find the smallest non-zero number in the pair-wise distance matrix, suppose row \(i\) and column \(j\), merge columns \(i\) and \(j\) and rows \(i\) and \(j\) at the same time: for single linkage, take the minimum of the numbers in the two rows and columns; for complete linkage, take the maximum.
📗 Answer (matrix with multiple lines, each line is a comma separated vector): .
📗 [3 points] Given three clusters, \(A\) = {, }, \(B\) = {\(x\)}, \(C\) = {, }. Find a value of \(x\) so that \(A\) and \(B\) will be merged in the next iteration of single linkage hierarchical clustering, and \(B\) and \(C\) will be merged in the next iteration of complete linkage hierarchical clustering. Break ties by merging with the cluster with the smaller index (i.e. \(A\), then \(B\), then \(C\)).
📗 Note: there can be multiple answers, including non-integer answers, enter one of them. If there are none, enter 0.
Hint Compute the single linkage and complete linkage distances between the three clusters as a function of x, then write down an inequality that describes when B is closer to A vs when B is closer to C.
📗 Answer: .
📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the questions that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Answer: .

# Grade


 * * * *

 * * * * *

# Submission


📗 Please do not modify the content in the above text field: use the "Grade" button to update.


📗 Please wait for the message "Successful submission." to appear after the "Submit" button. If there is an error message or no message appears after 10 seconds, please save the text in the above text box to a file using the button or copy and paste it into a file yourself and submit it to Canvas Assignment M7. You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##m: 7" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.




# Solutions

📗 Some of the past exams referenced in the Hints can be found on Professor Zhu, Professor Liang and Professor Dyer's websites: Link, and Link.
📗 Some of the questions are from last year, and I recorded videos going through them, the links are at the bottom of the Week 1 to Week 8 pages, for example: W4 and W8.





Last Updated: January 20, 2025 at 3:12 AM