Young Wu's Homepage

Prev: M6 Next: M8
Back to week 3 page: Link

# M7 Written (Math) Problems

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key)

📗 The official deadline is Jul 18, late submissions within a week will be accepted without penalty, but please submit a regrade request form: Link.

📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.

📗 Please do not refresh the page: your answers will not be saved.

📗 Please report any bugs on Piazza: Link

# Warning: please enter your ID before you start!

# Question 1

# Question 2

# Question 3

# Question 4

# Question 5

# Question 6

# Question 7

# Question 8

# Question 9

# Question 10

# Question 11

📗 [4 points] Consider a classification problem with \(n\) = classes \(y \in \left\{1, 2, ..., n\right\}\), and two binary features \(x_{1}, x_{2} \in \left\{0, 1\right\}\). Suppose \(\mathbb{P}\left\{Y = y\right\}\) = , \(\mathbb{P}\left\{X_{1} = 1 | Y = y\right\}\) = , \(\mathbb{P}\left\{X_{2} = 1 | Y = y\right\}\) = . Which class will naive Bayes classifier produce on a test item with \(X_{1}\) = and \(X_{2}\) = .

Hint

See Fall 2016 Final Q18, Fall 2011 Midterm Q20. Use the Bayes rule: \(\mathbb{P}\left\{Y = y | X_{1} = x_{1}, X_{2} = x_{2}\right\} = \dfrac{\mathbb{P}\left\{X_{1} = x_{1}, X_{2} = x_{2} | Y = y\right\} \mathbb{P}\left\{Y = y\right\}}{\displaystyle\sum_{y'=1}^{n} \mathbb{P}\left\{X_{1} = x_{1}, X_{2} = x_{2} | Y = y'\right\} \mathbb{P}\left\{Y = y'\right\}}\), which is equal to \(\dfrac{\mathbb{P}\left\{X_{1} = x_{1} | Y = y\right\} \mathbb{P}\left\{X_{2} = x_{2} | Y = y\right\} \mathbb{P}\left\{Y = y\right\}}{\displaystyle\sum_{y'=1}^{n} \mathbb{P}\left\{X_{1} = x_{1} | Y = y'\right\} \mathbb{P}\left\{X_{2} = x_{2} | Y = y'\right\} \mathbb{P}\left\{Y = y'\right\}}\), due to the independence assumption of Naive Bayes. For Bayesian network that are not Naive, the second equality is not true. Naive Bayes classifier selects the \(y\) that maximizes \(\mathbb{P}\left\{Y = y | X_{1} = x_{1}, X_{2} = x_{2}\right\}\): since the denominators for these probabilities are the same, and the prior probability is constant, the classifier is effectively selecting the \(y\) that maximizes \(\mathbb{P}\left\{X_{1} = x_{1} | Y = y\right\} \mathbb{P}\left\{X_{2} = x_{2} | Y = y\right\}\) which is a function in \(y\). You can try different values of \(y\) to find the maximizer or use the first derivative condition if the number of classes is large (i.e. compare the integers near the places where the first derivative is zero and the end points).

📗 Answer: .

📗 [4 points] Consider the problem of detecting if an email message contains a virus. Say we use four random variables to model this problem: Boolean (binary) class variable \(V\) indicates if the message contains a virus or not, and three Boolean feature variables: \(A, B, C\). We decide to use a Naive Bayes Classifier to solve this problem so we create a Bayesian network with arcs from \(V\) to each of \(A, B, C\). Their associated CPTs (Conditional Probability Table) are created from the following data: \(\mathbb{P}\left\{V = 1\right\}\) = , \(\mathbb{P}\left\{A = 1 | V = 1\right\}\) = , \(\mathbb{P}\left\{A = 1 | V = 0\right\}\) = , \(\mathbb{P}\left\{B = 1 | V = 1\right\}\) = , \(\mathbb{P}\left\{B = 1 | V = 0\right\}\) = , \(\mathbb{P}\left\{C = 1 | V = 1\right\}\) = , \(\mathbb{P}\left\{C = 1 | V = 0\right\}\) = . Compute \(\mathbb{P}\){ \(A\) = , \(B\) = , \(C\) = }.

Hint

See Spring 2017 Final Q7. Naive Bayes is a special simple Bayesian Network, so the way to compute the joint probabilities is the same (product of conditional probabilities given the parents): \(\mathbb{P}\left\{A = a, B = b, C = c\right\} = \mathbb{P}\left\{A = a, B = b, C = c, V = 0\right\} + \mathbb{P}\left\{A = a, B = b, C = c, V = 1\right\}\) and \(\mathbb{P}\left\{A = a, B = b, C = c, V = v\right\} = \mathbb{P}\left\{A = a | V = v\right\} \mathbb{P}\left\{B = b | V = v\right\} \mathbb{P}\left\{C = c | V = v\right\} \mathbb{P}\left\{V = v\right\}\).

📗 Answer: .

📗 [3 points] Consider the following directed graphical model over binary variables: \(A \to B \leftarrow C\). Given the CPTs (Conditional Probability Table):

Variable	Probability	Variable	Probability
\(\mathbb{P}\left\{A = 1\right\}\)
\(\mathbb{P}\left\{C = 1\right\}\)
\(\mathbb{P}\left\{B = 1 \| A = C = 1\right\}\)		\(\mathbb{P}\left\{B = 1 \| A = 0, C = 1\right\}\)
\(\mathbb{P}\left\{B = 1 \| A = 1, C = 0\right\}\)		\(\mathbb{P}\left\{B = 1 \| A = C = 0\right\}\)

What is the probability that \(\mathbb{P}\){ \(A\) = , \(B\) = , \(C\) = }?

Hint

See Fall 2019 Final Q22 Q23 Q24 Q25, Spring 2018 Final Q24 Q25, Fall 2014 Final Q9, Fall 2006 Final Q20, Fall 2005 Final Q20. For any Bayes net, the joint probability can always be computed as the product of the conditional probabilities (conditioned on the parent node variable). For a causal chain \(A \to B \to C\), \(\mathbb{P}\left\{A = a, B = b, C = c\right\} = \mathbb{P}\left\{A = a\right\} \mathbb{P}\left\{B = b | A = a\right\} \mathbb{P}\left\{C = c | B = b\right\}\). For a common cause \(A \leftarrow B \to C\), \(\mathbb{P}\left\{A = a, B = b, C = c\right\} = \mathbb{P}\left\{A = a | B = b\right\} \mathbb{P}\left\{B = b\right\} \mathbb{P}\left\{C = c | B = b\right\}\). For a common effect \(A \to B \leftarrow C\), \(\mathbb{P}\left\{A = a, B = b, C = c\right\} = \mathbb{P}\left\{A = a\right\} \mathbb{P}\left\{B = b | A = a, C = c\right\} \mathbb{P}\left\{C = c\right\}\).

📗 Answer: .

📗 [5 points] Consider the following directed graphical model over binary variables: \(A \to B \to C\). Given the CPTs (Conditional Probability Table):

Variable	Probability	Variable	Probability
\(\mathbb{P}\left\{A = 1\right\}\)
\(\mathbb{P}\left\{B = 1 \| A = 1\right\}\)		\(\mathbb{P}\left\{B = 1 \| A = 0\right\}\)
\(\mathbb{P}\left\{C = 1 \| B = 1\right\}\)		\(\mathbb{P}\left\{C = 1 \| B = 0\right\}\)

What is the probability that \(\mathbb{P}\){ \(A\) = \(|\) \(C\) = }?

Hint

See Fall 2019 Final Q22 Q23 Q24 Q25, Spring 2018 Final Q24 Q25, Fall 2014 Final Q9, Fall 2006 Final Q20, Fall 2005 Final Q20. For any type of network, one way (brute force, not really efficient) is to use the marginal distributions: \(\mathbb{P}\left\{A = a | C = c\right\} = \dfrac{\mathbb{P}\left\{A = a, C = c\right\}}{\mathbb{P}\left\{C = c\right\}} = \dfrac{\displaystyle\sum_{b'} \mathbb{P}\left\{A = a, B = b', C = c\right\}}{\displaystyle\sum_{a', b'} \mathbb{P}\left\{A = a', B = b', C = c\right\}}\). The joint probabilities can be calculated the same way as the previous question.

📗 Answer: .

📗 [5 points] Consider the following directed graphical model over binary variables: \(A \leftarrow B \to C\). Given the CPTs (Conditional Probability Table):

Variable	Probability	Variable	Probability
\(\mathbb{P}\left\{B = 1\right\}\)
\(\mathbb{P}\left\{C = 1 \| B = 1\right\}\)		\(\mathbb{P}\left\{C = 1 \| B = 0\right\}\)
\(\mathbb{P}\left\{A = 1 \| B = 1\right\}\)		\(\mathbb{P}\left\{A = 1 \| B = 0\right\}\)

What is the probability that \(\mathbb{P}\){ \(A\) = \(|\) \(C\) = }?

Hint

📗 Answer: .

📗 [4 points] Consider the following Bayesian Network containing 5 Boolean random variables. How many numbers must be stored in total in all CPTs (Conditional Probability Table) associated with this network (excluding the numbers that can be calculated from other numbers)?

Hint

See Fall 2019 Final Q21, Spring 2017 Final Q8, Fall 2011 Midterm Q15. A node with one parent \(\mathbb{P}\left\{B | A\right\}\) requires storing \(\left(n_{B} - 1\right) n_{A}\) probabilities; a node with no parent \(\mathbb{P}\left\{A\right\}\) requires storing \(n_{A} - 1\) probabilities, and a node with two parents \(\mathbb{P}\left\{C | A, B\right\}\) requires storing \(\left(n_{C} - 1\right) n_{A} n_{B}\) probabilities, ...

📗 Answer: .

📗 [2 points] An n-gram language model computes the probability \(\mathbb{P}\left\{w_{n} | w_{1}, w_{2}, ..., w_{n-1}\right\}\). How many parameters need to be estimated for a -gram language model given a vocabulary size of ?

Hint

See Fall 2019 Final Q8. The idea is similar to computing the number of conditional probabilities that are required to store for a Bayesian network: \(\mathbb{P}\left\{A | B, C, D, ...\right\}\) requires storing \(\left(n_{A} - 1\right) n_{B} n_{C} n_{D} ...\) probabilities. The reason that there is a \(-1\) in the first term is because \(\mathbb{P}\left\{A = n_{A} | B = b, C = c, D = d, ...\right\} = 1 - \displaystyle\sum_{i=1}^{n_{A} - 1} \mathbb{P}\left\{A = i | B = b, C = c, D = d, ...\right\}\) can be computed from the other \(n_{A} - 1\) probabilities. Note that this is true for any combination of \(b, c, d, ...\) and there are \(n_{B} n_{C} n_{D} ...\) such combinations.

📗 Answer: .

📗 [4 points] Given the following transition matrix for a bigram model with words "", "" and "": . Row \(i\) column \(j\) is \(\mathbb{P}\left\{w_{t} = j | w_{t-1} = i\right\}\). What is the probability that the third word is "" given the first word is ""?

Hint

See Fall 2019 Final Q30. Sum over all possible values of the second word: \(\mathbb{P}\left\{w_{3} = j | w_{1} = i\right\} = \displaystyle\sum_{k=1}^{3} \mathbb{P}\left\{w_{3} = j | w_{2} = k\right\} \mathbb{P}\left\{w_{2} = k | w_{1} = i\right\}\), where the \(\mathbb{P}\left\{w_{t} | w_{t-1}\right\}\) probabilities are given by the transition matrix.

📗 Answer: .

📗 [3 points] Welcome to the Terrible-Three-Day-Tour! We will visit New York on Day 1. The rules for Day 2 and Day 3 are:

(a) If we were at New York the day before, with probability we will stay in New York, and with probability we will go to Baltimore.
(b) If we were at Baltimore the day before, with probability we will stay in Baltimore, and with probability we will go to Washington D.C.
On average, before you start the tour, what is your chance to visit (at least on one of the two days)?

Hint

See Fall 2009 Final Q13. W is visited only if the sequence is NBW, and B is not visited only if the sequence is NNN. Compute the sequence probabilities \(\mathbb{P}\left\{w_{3} = X, w_{2} = Y, w_{1} = Z\right\} = \mathbb{P}\left\{w_{3} = X | w_{2} = Y\right\} \mathbb{P}\left\{w_{2} = Y | w_{1} = Z\right\} \mathbb{P}\left\{w_{1} = Z\right\}\) using the transition probabilities.

📗 Answer: .

📗 [5 points] Andy is a three-month old baby. He can be happy (state 0), hungry (state 1), or having a wet diaper (state 2). Initially when he wakes up from his nap at 1pm, he is happy. If he is happy, there is a chance that he will remain happy one hour later, a chance to be hungry by then, and a chance to have a wet diaper. Similarly, if he is hungry, one hour later he will be happy with chance, hungry with chance, and wet diaper with chance. If he has a wet diaper, one hour later he will be happy with chance, hungry with chance, and wet diaper with chance. He can smile (observation 0) or cry (observation 1). When he is happy, he smiles of the time and cries of the time; when he is hungry, he smiles of the time and cries of the time; when he has a wet diaper, he smiles of the time and cries of the time.

What is the probability that the particular observed sequence (or \(Y_{1}, Y_{2}\) = ) happens (in the first two periods)?

Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.

Hint

See Fall 2019 Final Q28 Q29, Spring 2018 Final Q28 Q29, Spring 2017 Final Q9, Fall 2008 Final Q4. The question is asking for \(\mathbb{P}\left\{Y_{1} = y_{1}, Y_{2} = y_{2}\right\}\) which is equal to \(\displaystyle\sum_{x_{1}, x_{2} \in \left\{s_{0}, s_{1}, s_{2}\right\}} \mathbb{P}\left\{Y_{1} = y_{1}, Y_{2} = y_{2}, X_{1} = x_{1}, X_{2} = x_{2}\right\}\) using the definition of the marginal distribution, which is equal to \(\displaystyle\sum_{x_{1}, x_{2} \in \left\{s_{0}, s_{1}, s_{2}\right\}} \mathbb{P}\left\{Y_{1} = y_{1} | X_{1} = x_{1}\right\} \mathbb{P}\left\{Y_{2} = y_{2} | X_{2} = x_{2}\right\} \mathbb{P}\left\{X_{2} = x_{2} | X_{1} = x_{1}\right\} \mathbb{P}\left\{ X_{1} = x_{1}\right\}\) using the Bayesian network joint distribution formula (i.e. the joint probability is the product of the conditional probabilities given the parents), and all of these conditional probabilities are given in the diagram (or in the description of the question).

📗 Answer: .

📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the questions that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.

📗 Answer: .

# Grade

* * * * *

* * * * *

# Submission

📗 Please do not modify the content in the above text field: use the "Grade" button to update.

Check the box to confirm submission.

📗 Please wait for the message "Successful submission." to appear after the "Submit" button. If there is an error message or no message appears after 10 seconds, please save the text in the above text box to a file using the button or copy and paste it into a file yourself and submit it to Canvas Assignment M7. You could submit multiple times (but please do not submit too often): only the latest submission will be counted.

📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##m: 7" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.

# Solutions

📗 Some of the past exams referenced in the Hints can be found on Professor Zhu, Professor Liang and Professor Dyer's websites: Link, and Link.

📗 Some of the questions are from last year, and I recorded videos going through them, the links are at the bottom of the Week 1 to Week 8 pages, for example: W4 and W8.

📗 The links to the solutions the students volunteered to share on Piazza will be collected in this post around the official due date: Link.

Last Updated: June 27, 2026 at 9:06 PM