Prev: P1 Next: P2
Back to week 1 page: Link



# P1 Programming Problem Instruction

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key)
📗 The official deadline is July 4, but you can submit or resubmit without penalty until August 15.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.
📗 Please do not refresh the page: your answers will not be saved.
📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.
📗 Please report any bugs on Piazza.

# Warning: please enter your ID before you start!



📗 (Introduction) In this project, you will build a logistic regression model and a neural network to classify hand-written digits. Your models should take pixel intensities of images as inputs and output which digits the images display.

📗 (Part 1) Read and download the training set images and labels from MNIST or CSV Files (easier to read) or the same dataset in another format from other places.

📗 (Part 1) Extract the training set data of the digits ? (label it 0) and ? (label it 1). Suppose there are \(n\) images in your training set, you should create an \(n \times 784\) feature matrix \(x\) and an \(n \times 1\) vector of labels \(y\). Please rescale so that the feature vectors contain only numbers between 0 and 1. You can do this by dividing all the numbers by 255. The training images contain \(28 \times 28 = 784\) pixels, and each pixel corresponds to an input unit.)

📗 (Part 1) Train a logistic regression on the dataset and plot the weights in a 28 by 28 grid.

📗 (Part 1) Predict the new images in the following test set. The predictions should be one of 0 or 1.

Note: this field may take a few seconds to load. You can either use the button to download a text file, or copy and paste from the text box into Excel or a csv file . Please do not change the content of the text box.

📗 (Part 2) Train a neural network with one hidden layer. The number of hidden units should be square root of the number of input units (here, the number of input units is 784, so the number of hidden units should be 28). The activation function you should use is logistic in both layers.

📗 (Part 2) Predict the new images in the same test set. The predictions should be either 0 or 1.

# Question 1

📗 [1 points] Enter the feature vector of any one training image (784 numbers, rounded to 2 decimal places, in one line, comma separated):
Hint 📗 When reading the CSV file, you should read only the lines corresponding to the digits you are classifying and create an n by 1 array \(y\) to store the first column (either 0 or 1), and create an n by m array \(x\) to store the remaining columns.
📗 Make sure that you rescale (by dividing by 255) so that the feature vectors contain only numbers between 0 and 1.


Plot the image to make sure you entered the vector correctly:



# Question 2

📗 [5 points] Enter the logistic regression weights and biases (784 + 1 numbers, rounded to 4 decimal places, in one line, comma separated), the bias term should be the last number:
Hint 📗 Create an m by 1 array \(w\) to store the weights and a number \(b\). Initialize them with random numbers between -1 and 1. (Initializing them with 0 is okay too, but it can make convergence slower.)
📗 Pick a learning rate \(\alpha\): the choice depends on the digits you are classifying, but you could start by trying \(\alpha = 0.1, 0.01, 0.001\) and look at how quickly the weights converge.
📗 Update \(w\) and \(b\) according the formula in the Lecture 2 slides:
\(a_{i} = \dfrac{1}{1 + \exp\left(-\left(\left(\displaystyle\sum_{j=1}^{m} w_{j} x_{ij}\right) + b\right)\right)}\) for i = 1, ..., n,
\(w_{j} = w_{j} - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right) x_{ij}\) for j = 1, ..., m,
\(b = b - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\).
📗 Note: if you want to use stochastic gradient descent, you should shuffle the dataset before going through \(i\) and update:
\(w_{j} = w_{j} - \alpha \left(a_{i} - y_{i}\right) x_{ij}\) for j = 1, ..., m,
\(b = b - \alpha \left(a_{i} - y_{i}\right)\).
See Wikipedia for one algorithm to shuffle.
📗 Remember to compute the cost at each step and store the cost from the previous step:
\(C = -\displaystyle\sum_{i=1}^{n} \left(y_{i} \log a_{i} + \left(1 - y_{i}\right) \log \left(1 - a_{i}\right)\right)\).
📗 Due to the problem that \(0 \log 0\) = NaN in many programming languages, the cost function should be written as \(C = \displaystyle\sum_{i} C_{i}\):
\(C_{i} = -\log\left(1 - a_{i}\right)\) if \(y_{i} = 0\),
and \(C_{i} = -\log a_{i}\) if \(y_{i} = 1\),
and \(C_{i}\) = something very large (like 10000) if \(y_{i} = 0, a_{i} \approx 1\) or \(y_{i} = 1, a_{i} \approx 0\).
📗 Repeat the previous two steps until the decrease in \(C\) is smaller than something like 0.0001, or \(C\) is smaller than a fixed number like 1, or the number of iterations (epochs) is too large (say 1000).






Now you can use your regression model to classify a digit you draw:


Activation: ?, Class: ?, Digit: ?.
Corresponding feature vector: .

# Question 3

📗 [10 points] Enter the activation values on the test set (200 numbers between 0 and 1, rounded to 2 decimal places, in one line, comma separated):
Hint 📗 Read the test file: the first column is not \(y\), so just store everything in an array \(\hat{x}\). Remember to divide \(\hat{x}\) by 255.
📗 For each line \(\hat{x}_{i}\), compute the activation value by:
\(a_{i} = \dfrac{1}{1 + \exp\left(-\left(\left(\displaystyle\sum_{j=1}^{m} w_{j} \hat{x}_{ij}\right) + b\right)\right)}\).




# Question 4

📗 [10 points] Enter the predicted values on the test set (200 integers, 0 or 1, prediction, in one line):
Hint 📗 If \(a_{i} < 0.5\), the label is 0, and if \(a_{i} \geq 0.5\), the label is 1.




# Question 5

📗 [5 points] Enter the first layer weights and biases (784 + 1 lines, each line containing 28 numbers, rounded to 4 decimal places, comma separated). The biases should be on the last line: for the first 784 lines, line i element j represents the weight from input unit i to hidden unit j, and for the last line, element j represents the bias for the hidden unit j:
Hint 📗 Let h be the number of hidden units. Create an m by h array \(w^{\left(1\right)}\) to store the weights for layer 1 and an h by 1 array \(b^{\left(1\right)}\) to store the biases for layer 1. Create an h by 1 array \(w^{\left(2\right)}\) to store the weights for layer 2 and a number \(b^{\left(2\right)}\) to store the bias for layer 2. Initialize them with random numbers between -1 and 1. (Initializing them with 0 is okay too, but it can make convergence slower.)
📗 Pick a learning rate \(\alpha\): the choice depends on the digits you are classifying, but you could start by trying \(\alpha = 0.1, 0.01, 0.001\) and look at how quickly the weights converge.
📗 Batch gradient descent is too slow: use stochastic gradient descent instead, you should shuffle the dataset before going through \(i\) and update \(w\) and \(b\) according the formula in the Lecture 3 or 4 slides (please check to make sure they are correct!):
\(a^{\left(1\right)}_{ij} = \dfrac{1}{1 + \exp\left(- \left(\left(\displaystyle\sum_{j'=1}^{m} x_{ij'} w^{\left(1\right)}_{j'j}\right) + b^{\left(1\right)}_{j}\right)\right)}\) for j = 1, ..., h,
\(a^{\left(2\right)}_{i} = \dfrac{1}{1 + \exp\left(- \left(\left(\displaystyle\sum_{j=1}^{h} a^{\left(1\right)}_{ij} w^{\left(2\right)}_{j}\right) + b^{\left(2\right)}\right)\right)}\),
\(\dfrac{\partial C}{\partial w^{\left(1\right)}_{j'j}} = \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right) w_{j}^{\left(2\right)} a_{ij}^{\left(1\right)} \left(1 - a_{ij}^{\left(1\right)}\right) x_{ij'}\) for j' = 1, ..., m, j = 1, ..., h,
\(\dfrac{\partial C}{\partial b^{\left(1\right)}_{j}} = \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right) w_{j}^{\left(2\right)} a_{ij}^{\left(1\right)} \left(1 - a_{ij}^{\left(1\right)}\right)\) for j = 1, ..., h,
\(\dfrac{\partial C}{\partial w^{\left(2\right)}_{j}} = \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right) a_{ij}^{\left(1\right)}\) for j = 1, ..., h,
\(\dfrac{\partial C}{\partial b^{\left(2\right)}} = \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right)\),
\(w^{\left(1\right)}_{j' j} \leftarrow w^{\left(1\right)}_{j' j} - \alpha \dfrac{\partial C}{\partial w^{\left(1\right)}_{j' j}}\) for j' = 1, ..., m, j = 1, ..., h,
\(b^{\left(1\right)}_{j} \leftarrow b^{\left(1\right)}_{j} - \alpha \dfrac{\partial C}{\partial b^{\left(1\right)}_{j}}\) for j = 1, ..., h,
\(w^{\left(2\right)}_{j} \leftarrow w^{\left(2\right)}_{j} - \alpha \dfrac{\partial C}{\partial w^{\left(2\right)}_{j}}\) for j = 1, ..., h,
\(b^{\left(2\right)} \leftarrow b^{\left(2\right)} - \alpha \dfrac{\partial C}{\partial b^{\left(2\right)}}\).
📗 It is very easy to make a mistake here. You can check if the gradient computation is correct by computing the numerical gradient using finite differences and compare it with your gradient:
\(\dfrac{\partial C}{\partial v} \approx \dfrac{C\left(v + \varepsilon\right) - C\left(v\right)}{\varepsilon}, \varepsilon = 0.0001\).
Here, \(v\) is one of \(w^{\left(1\right)}\), \(b^{\left(1\right)}\), \(w^{\left(2\right)}\), \(b^{\left(2\right)}\). Details see: Wikipedia.
📗 Remember to compute the cost at each step and store the cost from the previous step:
\(C = \dfrac{1}{2} \displaystyle\sum_{i=1}^{n} \left(y_{i} - a^{\left(2\right)}_{i}\right)^{2}\).
📗 Repeat the previous two steps until the decrease in \(C\) is smaller than something like 0.0001, or \(C\) is smaller than a fixed number like 1, or the number of iterations (epochs) is too large (say 1000).




# Question 6

📗 [5 points] Enter the second layer weights (28 + 1 numbers, rounded to 4 decimal places, in one line, comma separated). The bias should be the last number:
Hint 📗 For each line \(\hat{x}_{i}\) in the test set, compute the activation value by:
\(a'^{\left(1\right)}_{ij} = \dfrac{1}{1 + \exp\left(- \left(\left(\displaystyle\sum_{j'=1}^{m} x'_{ij'} w^{\left(1\right)}_{j'j}\right) + b^{\left(1\right)}_{j}\right)\right)}\).




Now you can use your network to classify a digit you draw:


Activation: ?, Class: ?, Digit: ?.
Corresponding feature vector: .

# Question 7

📗 [10 points] Enter the second layer activation values on the test set (200 numbers between 0 and 1, rounded to 2 decimal places, in one line, comma separated):
Hint 📗 For each line \(\hat{x}_{i}\) in the test set, compute the activation value by:
\(a'^{\left(2\right)}_{i} = \dfrac{1}{1 + \exp\left(-\left(\left(\displaystyle\sum_{j=1}^{h} a'^{\left(1\right)}_{ij} w^{\left(2\right)}_{j}\right) + b^{\left(2\right)}\right)\right)}\).




# Question 8

📗 [10 points] Enter the predicted values on the test set (200 integers, 0 or 1, prediction, in one line):
Hint 📗 If \(a_{i} < 0.5\), the label is 0, and if \(a_{i} \geq 0.5\), the label is 1.




# Question 9

📗 [1 points] Enter the feature vector of one test image that is labeled incorrectly by your network (784 numbers in one line, rounded to 2 decimal places, comma separated). You can look at a few images that your network is uncertain of (the second layer activation is the closest to 0.5): if you cannot find any, you can use one that you draw yourself too.
Hint 📗 *spoiler* The first 100 images are for the first digit and the next 100 images are for the second digit. Please do not use this information and include the test images in your training set.


Plot the image:


# Question 10

📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Answer: .

# Grade


 ***** ***** ***** ***** ***** 

 ***** ***** ***** ***** *****

# Submission


📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.


📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted. 
📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . Please submit the resulting file with your code on Canvas Assignment P1.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##p: 1" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.


📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.

# Solutions

📗 The sample solution in Java and Python will be posted around the deadline. You are allowed to copy and use parts of the solution without attribution. You are allowed to use code from other people and from the Internet, but you must give proper attribution at the beginning of the your code. MOSS will be used for code plagiarism check: blocks of copied code without attribution will result in a zero for the whole assignment.
📗 Sample solution from last year: 2020 P1. The homework is slightly different, please use with caution.
📗 Sample solution:
Java Part 1: File
Python Part 1: File
Java Part 2: File
Python Part 2: File
In part 2, the solution uses the ReLU activation function in the first layer, you have to change the activation and the gradient so that the code solves your version of the problem.
In both parts, you have to change the hyper-parameters including the learning rate, the maximum number of iterations, and the stopping criterion based on your version of the training set and test set. You also have to figure out which variables to output: at the moment, the solution does not output the correct variables.
📗 You can get help on understanding the algorithm from any of the office hours. To get help with debugging code in Java, please come during the Monday to Friday 2:00 to 3:00 Zoom office hours or Saturday to Sunday 2:00 to 3:00 (I can stay for a few hours after 3:00 by appointment) in-person office hours. To get help with debugging code in Python, please come during the Tuesday 3:00 to 5:00 in-person office hours or the Thursday 3:00 to 5:00 Zoom office hours. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.





Last Updated: April 29, 2024 at 1:11 AM