Prev: P2 Next: P4
Back to week 3 page: Link


# P3 Programming Problem Instruction

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key)
📗 You can also load from your saved file
and click .
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The official deadline is July 10, late submissions within two weeks will be accepted without penalty, but please submit a regrade request form: Link.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.
📗 Please do not refresh the page: your answers will not be saved.
📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.
📗 Please report any bugs on Piazza: Link

# Warning: please enter your ID before you start!


TODO: swap the prior probabilities next year.

📗 (Introduction) In this programming homework, you will build and simulate a simple bigram (Markov chain) model based on a movie script. You will use it to generate new sentences that hopefully contain sensible words maybe even phrases. In addition, you will build a Naive Bayes classifier to distinguish sentences from the script and sentences from another fake script. Due to the English vocabulary size, you will use characters as tokens (and features) instead of words. In practice, you could replace the 26 letters by (more than 170,000) English words when training these models.

📗 (Part 1) Download the script of one of the following movies: "?", "?", "?" from IMSDb: Link. If you have another movie you really like, you can use that script instead. Go to the website, use "Search IMSDb" on the left to search for this movie, click on the link: "Read ... Script", and copy and paste the script into a text file.

📗 (Part 1) Make everything lower case, then remove all characters except for letters and spaces. Replace consecutive spaces by one single space and make sure that there are no consecutive spaces.

📗 (Part 1) Construct the bigram character (letters + space) transition probability table. Put "space" first then "a", "b", "c", ..., "z". It should be a 27 by 27 matrix.

📗 (Part 1) Construct the trigram transition probability table. It could be a 27 by 27 by 27 array or a 729 by 27 matrix. You do not have to submit this table.

📗 (Part 1) Generate 26 sentences consists of 1000 characters each using the trigram model starting from "a", "b", "c", ..., "z". You should use the bigram model to generate the second character and switch to the bigram model when the current two-character sequence never appeared in the script. For example, when you see "xz", instead of using the trigram model for the probabilities of Pr{? | xz}, switch to use the bigram model for the probabilities of Pr{? | z}. Find and share some interesting sentences.

📗 (Part 2) The following is my randomly generated fake script written according to an non-English language model.

You can either use the button to download a text file, or copy and paste from the text box to a text file. Please do not change the content of the text box.

📗 (Part 2) Train a Naive Bayes classifier based on the characters. You should use the prior that is biased: Pr{Document = your script} = ? and Pr{Document = fake script} = ?, compute the likelihood Pr{Letter | Document} based on your script and the fake script, and compute the posterior probabilities Pr{Document | Letter}, and test your classifier on the 26 random sentences you generated.

# Question 1 (Part 1)

📗 [1 points] Please enter the name of the movie script you used.
📗 Answer:

# Question 2 (Part 1)

📗 [5 points] Input the unigram probabilities (27 numbers, comma-separated, rounded to 4 decimal places, "space" first, then "a", "b", ...). Note: "0" should be rounded up to "0.0001" and the probabilities should sum up to "1.0000".
Hint
📗 Please make sure all probabilities are strictly larger than 0 after rounded to 4 decimal places, for example, 0.00002 should be rounded up to 0.0001 instead of 0.0. Also, the numbers should add up to exactly 1, you can do this by adding or subtracting the difference between the sum of the rounded probabilities and 1 to one of the probabilities.
📗 If there are \(n\) characters and \(n_{a}\) "a"s, then the unigram probability of "a" is \(p_{a} = \dfrac{n_{a}}{n}\).




You can click to plot the unigram probabilities.


# Question 3 (Part 1)

📗 [5 points] Input the bigram transition probabilities without Laplace smoothing (27 lines, each line containing 27 numbers, comma-separated, rounded to 4 decimal places, "space" first, then "a", "b", ...). Note: "0" should be rounded up to "0.0001" and the probabilities on each row should sum up to "1.0000".
Hint
📗 If there are \(n_{a}\) "a"s and \(n_{a b}\) "ab"s, then the bigram transition probability from "a" to "b" (row "a" column "b") is \(p_{a b} = \dfrac{n_{a b}}{n_{a}}\).




You can click to plot the unigram probabilities.


# Question 4 (Part 1)

📗 [5 points] Input the bigram transition probabilities with Laplace smoothing (27 lines, each line containing 27 numbers, comma-separated, rounded to 4 decimal places, "space" first, then "a", "b", ...). Note: "0" should be rounded up to "0.0001" and the probabilities on each row should sum up to "1.0000".
Hint
📗 Again, please make sure all probabilities are strictly larger than 0 after rounded to 4 decimal places, for example, 0.00002 should be rounded up to 0.0001 instead of 0.0. Also, the numbers should add up to exactly 1, you can do this by adding or subtracting the difference between the sum of the rounded probabilities and 1 to one of the probabilities.
📗 If there are \(n_{a}\) "a"s and \(n_{a b}\) "ab"s, then the bigram transition probability from "a" to "b" (row "a" column "b") is \(p_{a b} = \dfrac{n_{a b} + 1}{n_{a} + 27}\) with Laplace smoothing.




You can click to plot the unigram probabilities.


# Question 5 (Part 1)

📗 [10 points] Input the 26 sentences generated by the trigram (or N-gram with \(N \geq 3\)) and bigram models (Laplace smoothed). (26 lines, each line containing 1000 characters, line 1 starts with "a", line 2 starts with "b" ...).
Hint
📗 Suppose you start with "a", then you draw a random character according to the distribution specified by row "a" of the bigram transition matrix, i.e. \(p_{a, \text{\;space\;}}, p_{a a}, p_{a, b}, ..., p_{a, z}\).
📗 Suppose you start with "ab" and \(p_{a b} > 0\) without Laplace smoothing, then you draw a random character according to the distribution specified by row "a" column "b" of trigram transition array, i.e. \(p_{a b \text{\;space\;}}, p_{a b a}, p_{a b b}, ..., p_{a b z}\).
📗 Suppose you start with "yz" and \(p_{y z} = 0\) without Laplace smoothing, i.e. "yz" never appeared in your script, then you draw a random character according to the distribution specified by row "z" of the bigram transition matrix again, i.e. \(p_{z \text{\;space\;}}, p_{z a}, p_{z b}, ..., p_{z z}\).
📗 To generate a random character using CDF inversion according to a distribution, for example \(p_{1}, p_{2}, p_{3}\): compute the CDF \(p_{1}, p_{1} + p_{2}, p_{1} + p_{2} + p_{3}\), generate a random number \(u\) between 0 and 1, if \(0 \leq u < p_{1}\), output 1; if \(p_{1} \leq u < p_{1} + p_{2}\), output 2; if \(p_{1} + p_{2} \leq u < p_{1} + p_{2} + p_{3} = 1\), output 3.




# Question 6 (Part 1)

📗 [2 points] Find one interesting sentence that at least contains English words.
📗 Answer:

Try if Speech Synthesis can pronounce some of the words: , , , , , .

# Question 7 (Part 2)

📗 [5 points] Enter likelihood probabilities of the Naive Bayes estimator for the fake script. (27 numbers, comma separated, rounded to 4 decimal places, Pr{"space" | D = fake script}, Pr{"a" | D = fake script}, Pr{"b" | D = fake script}, ...). The likelihood probabilities for your script should be the same as your answer to Question 2.
Hint
📗 If there are \(n'\) characters and \(n'_{a}\) "a"s, then the likelihood probability of "a" is \(p'_{a} = \dfrac{n'_{a}}{n'}\).




You can click to plot the unigram probabilities.


# Question 8 (Part 2)

📗 [5 points] Enter posterior probabilities of the Naive Bayes estimator for the fake script. (27 numbers, comma separated, rounded to 4 decimal places, Pr{D = fake script | "space"}, Pr{D = fake script | "a"}, Pr{D = fake script | "b"}, ...).
Hint
📗 Suppose your prior probability that the script is fake is \(p\) (you can find this prior value under "Instruction" (Part 2), it is generated randomly based on your ID), then the posterior probability that the script is fake given there is a character "a" is,
\(\dfrac{p \cdot p'_{a}}{p \cdot p'_{a} + \left(1 - p\right) \cdot p_{a}}\).




# Question 9 (Part 2)

📗 [5 points] Use the Naive Bayes model to predict which document the 26 sentences your generated in Question 5 came from. Remember to compare the sum of log probabilities instead of the direct product of probabilities. (26 numbers, either 0 or 1, 0 is the label for your script, 1 is the label for the fake script)
Hint
📗 Your prediction is supposed to be all 0s, but if there are few 1s, it is okay, but make sure you generated the sentences in Part 1 correctly according to the instruction (especially switching to bigram from trigram when necessary).
📗 (Correction: this is not the Naive Bayes assumption covered during the lecture, it assumes independence of posterior instead of  the independence of likelihood, but you can use this if you like for this assignment too) The log-likelihood that a sentence "abbccc" came from the fake script is the sum of the log of the posterior probability of each character in the sentence, 
\(\log\left(\mathbb{P}\left\{M | a\right\}\right) + 2 \log\left(\mathbb{P}\left\{M | b\right\}\right) + 3 \log\left(\mathbb{P}\left\{M | c\right\}\right)\), \(M\) is the event "D = fake script".
📗 The log-likelihood that a sentence "abbccc" came from your script is similar, 
\(\log\left(1 - \mathbb{P}\left\{M | a\right\}\right) + 2 \log\left(1 - \mathbb{P}\left\{M | b\right\}\right) + 3 \log\left(1 - \mathbb{P}\left\{M | c\right\}\right)\), the probability in the log is the probability of the event "D = your script".
📗 Based on the Naive Bayes assumption from the lecture, the log-likelihood that a sentence "abbccc" came from the fake script is proportional to the sum of the log of the likelihood probability of each character in the sentence and the log of the prior probability,
(1) \(\log\left(\mathbb{P}\left\{a | M\right\}\right) + 2 \log\left(\mathbb{P}\left\{b | M\right\}\right) + 3 \log\left(\mathbb{P}\left\{c | M\right\}\right) + \log\left(\mathbb{P}\left\{M\right\}\right)\), \(M\) is the event "D = fake script".
📗 The log-likelihood that a sentence "abbccc" came from your script is similar, 
(2) \(\log\left(\mathbb{P}\left\{a | \neg M\right\}\right) + 2 \log\left(\mathbb{P}\left\{b | \neg M\right\}\right) + 3 \log\left(\left\{c | \neg M\right\}\right) + \log\left(\mathbb{P}\left\{\neg M\right\}\right)\), \(\neg M\) is the event "D = your script".
📗 The prediction will be 1 if expression (1) is larger than expression (2), and 0 otherwise.




# Question 10

📗 [1 points] Please confirm that you are going to submit the code on Canvas under Assignment P3, and make sure you give attribution for all blocks of code you did not write yourself (see bottom of the page for details and examples).
I will submit the code on Canvas.

# Question 11

📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Answer: .

# Grade


 * * * *

 * * * * *

# Submission


📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.


📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted. 
📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . You can also include the resulting file with your code on Canvas Assignment P3.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##p: 3" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.



📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.

# Solutions

📗 The sample solution in Java and Python will be posted around the deadline. You are allowed to copy and use parts of the solution with attribution. You are allowed to use code from other people (with their permission) and from the Internet, but you must and give attribution at the beginning of the your code. You are allowed to use large language models such as GPT4 to write parts of the code for you, but you have to include the prompts you used in the code submission. For example, you can put the following comments at the beginning of your code:
% Code attribution: (TA's name)'s P3 example solution.
% Code attribution: (student name)'s P3 solution.
% Code attribution: (student name)'s answer on Piazza: (link to Piazza post).
% Code attribution: (person or account name)'s answer on Stack Overflow: (link to page).
% Code attribution: (large language model name e.g. GPT4): (include the prompts you used).
📗 Solution:
📗 Java: Link
📗 Python: Link
📗 Sample solution from last year: 2022 P3. The homework is slightly different, please use with caution.
📗 You can get help on understanding the algorithm from any of the office hours; to get help with debugging, please go to the TA's office hours. For times and locations see the Home page. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.





Last Updated: November 30, 2024 at 4:34 AM