📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key) 1,2,3,4,5,6,7,8,9,10p3
📗 The official deadline is July 18, but you can submit or resubmit without penalty until August 15.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.
📗 Please do not refresh the page: your answers will not be saved.
📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.
📗 (Introduction) In this programming homework, you will build and simulate a simple Markov chain model based on a movie script. You will use it to generate new sentences that hopefully contain sensible words maybe even phrases. In addition, you will build a Naive Bayes classifier to distinguish sentences from the script and sentences from another fake script. Due to the English vocabulary size, you will use characters as features instead of words. In practice, you could replace the 26 letters by (more than 170,000) English words when training these models.
📗 (Part 1) Download the script of one of the following movies: "", "", "" from IMSDb: Link. If you have another movie you really like, you can use that script instead. Go to the website, use "Search IMSDb" on the left to search for this movie, click on the link: "Read ... Script", and copy and paste the script into a text file.
📗 (Part 1) Make everything lower case, then remove all characters except for letters and spaces. Replace consecutive spaces by one single space and make sure that there are no consecutive spaces.
📗 (Part 1) Construct the bigram character (letters + space) transition probability table. Put "space" first then "a", "b", "c", ..., "z". It should be a 27 by 27 matrix.
📗 (Part 1) Construct the trigram transition probability table. It could be a 27 by 27 by 27 array or a 729 by 27 matrix. You do not have to submit this table.
📗 (Part 1) Generate 26 sentences consists of 1000 characters each using the trigram model starting from "a", "b", "c", ..., "z". You should use the bigram model to generate the second character and switch to the bigram model when the current two-character sequence never appeared in the script. For example, when you see "xz", instead of using the trigram model for the probabilities of Pr{? | xz}, switch to use the bigram model for the probabilities of Pr{? | z}. Find and share some interesting sentences.
📗 (Part 2) The following is my randomly generated script written according to an non-English language model.
You can either use the button to download a text file, or copy and paste from the text box to a text file. Please do not change the content of the text box.
📗 (Part 2) Train a Naive Bayes classifier based on the characters. You should use the prior that is biased against your script: Pr{Document = your script} = ? and Pr{Document = my script} = ?, compute the likelihood Pr{Letter | Document} based on your script and my script, and compute the posterior probabilities Pr{Document | Letter}, and test your classifier on the 26 random sentences you generated.
📗 [5 points] Input the unigram probabilities (27 numbers, comma-separated, rounded to 4 decimal places, "space" first, then "a", "b", ...).
Hint
📗 Please make sure all probabilities are strictly larger than 0 after rounded to 4 decimal places, for example, 0.00002 should be rounded up to 0.0001 instead of 0.0. Also, the numbers should add up to exactly 1, you can do this by adding or subtracting the difference between the sum of the rounded probabilities and 1 to one of the probabilities.
📗 If there are \(n\) characters and \(n_{a}\) "a"s, then the unigram probability of "a" is \(p_{a} = \dfrac{n_{a}}{n}\).
📗 [5 points] Input the bigram transition probabilities without Laplace smoothing (27 lines, each line containing 27 numbers, comma-separated, rounded to 4 decimal places, "space" first, then "a", "b", ...).
Hint
📗 If there are \(n_{a}\) "a"s and \(n_{a b}\) "ab"s, then the bigram transition probability from "a" to "b" (row "a" column "b") is \(p_{a b} = \dfrac{n_{a b}}{n_{a}}\).
📗 [5 points] Input the bigram transition probabilities with Laplace smoothing (27 lines, each line containing 27 numbers, comma-separated, rounded to 4 decimal places, "space" first, then "a", "b", ...).
Hint
📗 Again, please make sure all probabilities are strictly larger than 0 after rounded to 4 decimal places, for example, 0.00002 should be rounded up to 0.0001 instead of 0.0. Also, the numbers should add up to exactly 1, you can do this by adding or subtracting the difference between the sum of the rounded probabilities and 1 to one of the probabilities.
📗 If there are \(n_{a}\) "a"s and \(n_{a b}\) "ab"s, then the bigram transition probability from "a" to "b" (row "a" column "b") is \(p_{a b} = \dfrac{n_{a b} + 1}{n_{a} + 27}\) with Laplace smoothing.
📗 [10 points] Input the 26 sentences generated by the trigram and bigram models (Laplace smoothed). (26 lines, each line containing 1000 characters, line 1 starts with "a", line 2 starts with "b" ...).
Hint
📗 Suppose you start with "a", then you draw a random character according to the distribution specified by row "a" of the bigram transition matrix, i.e. \(p_{a, \text{\;space\;}}, p_{a a}, p_{a, b}, ..., p_{a, z}\).
📗 Suppose you start with "ab" and \(p_{a b} > 0\) without Laplace smoothing, then you draw a random character according to the distribution specified by row "a" column "b" of trigram transition array, i.e. \(p_{a b \text{\;space\;}}, p_{a b a}, p_{a b b}, ..., p_{a b z}\).
📗 Suppose you start with "yz" and \(p_{y z} = 0\) without Laplace smoothing, i.e. "yz" never appeared in your script, then you draw a random character according to the distribution specified by row "z" of the bigram transition matrix again, i.e. \(p_{z \text{\;space\;}}, p_{z a}, p_{z b}, ..., p_{z z}\).
📗 To generate a random character using CDF inversion according to a distribution, for example \(p_{1}, p_{2}, p_{3}\): compute the CDF \(p_{1}, p_{1} + p_{2}, p_{1} + p_{2} + p_{3}\), generate a random number \(u\) between 0 and 1, if \(0 \leq u < p_{1}\), output 1; if \(p_{1} \leq u < p_{1} + p_{2}\), output 2; if \(p_{1} + p_{2} \leq u < p_{1} + p_{2} + p_{3} = 1\), output 3.
📗 [5 points] Enter likelihood probabilities of the Naive Bayes estimator for my script. (27 numbers, comma separated, rounded to 4 decimal places, Pr{"space" | D = my script}, Pr{"a" | D = my script}, Pr{"b" | D = my script}, ...). The likelihood probabilities for your script should be the same as your answer to Question 2.
Hint
📗 If there are \(n'\) characters and \(n'_{a}\) "a"s, then the likelihood probability of "a" is \(p'_{a} = \dfrac{n'_{a}}{n'}\).
📗 [5 points] Enter posterior probabilities of the Naive Bayes estimator for my script. (27 numbers, comma separated, rounded to 4 decimal places, Pr{D = my script | "space"}, Pr{D = my script | "a"}, Pr{D = my script | "b"}, ...).
Hint
📗 Suppose your prior probability that the script is mine is \(p\) (you can find this prior value under "Instruction" (Part 2), it is generated randomly based on your ID), then the posterior probability that the script is mine given there is a character "a" is,
📗 [5 points] Use the Naive Bayes model to predict which document the 26 sentences your generated in Question 5 came from. Remember to compare the sum of log probabilities instead of the direct product of probabilities. (26 numbers, either 0 or 1, 0 is the label for your script, 1 is the label for mine)
Hint
📗 Your prediction is supposed to be all 0s, but if there are few 1s, it is okay, but make sure you generated the sentences in Part 1 correctly according to the instruction (especially switching to bigram from trigram when necessary).
📗 The log-likelihood that a sentence "abbccc" came from my script is the sum of the log of the posterior probability of each character in the sentence,
\(\log\left(\mathbb{P}\left\{M | a\right\}\right) + 2 \log\left(\mathbb{P}\left\{M | b\right\}\right) + 3 \log\left(\mathbb{P}\left\{M | c\right\}\right)\), \(M\) is the event "D = my script".
📗 The log-likelihood that a sentence "abbccc" came from your script is similar,
\(\log\left(1 - \mathbb{P}\left\{M | a\right\}\right) + 2 \log\left(1 - \mathbb{P}\left\{M | b\right\}\right) + 3 \log\left(1 - \mathbb{P}\left\{M | c\right\}\right)\), the probability in the log is the probability of the event "D = your script".
📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.
📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . Please submit the resulting file with your code on Canvas Assignment P3.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##p: 3" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.
📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.
📗 The sample solution in Java and Python will be posted around the deadline. You are allowed to copy and use parts of the solution without attribution. You are allowed to use code from other people and from the Internet, but you must give proper attribution at the beginning of the your code. MOSS will be used for code plagiarism check: blocks of copied code without attribution will result in a zero for the whole assignment.
📗 Sample solution from last year: 2020 P3. The homework is slightly different, please use with caution.
📗 Sample solution:
Java: File
Python: File
For part 1, the solution does not use Laplace smoothing: you need to modify the formulas computing all probabilities.
For part 2, the solution assumes the prior is 0.5, so the formula is different. Make sure you change it.
📗 You can get help on understanding the algorithm from any of the office hours. To get help with debugging code in Java, please come during the Monday to Friday 2:00 to 3:00 Zoom office hours or Saturday to Sunday 2:00 to 3:00 (I can stay for a few hours after 3:00 by appointment) in-person office hours. To get help with debugging code in Python, please come during the Tuesday 3:00 to 5:00 in-person office hours or the Thursday 3:00 to 5:00 Zoom office hours. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.