Young Wu's Homepage

Prev: P1 Next: P2
Back to week 1 page: Link

# Warning: this is a draft, please do not start until the homework is announced on Canvas

# P1 Programming Problem Instruction

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key)

📗 The official deadline is June 13, late submissions within a week will be accepted without penalty, but please submit a regrade request form: Link.

📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.

📗 Please do not refresh the page: your answers will not be saved.

📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.

📗 Please report any bugs on Piazza.

# Warning: please enter your ID before you start!

📗 (Introduction) In this project, you will build a logistic regression model and a neural network to classify hand-written digits. Your models should take pixel intensities of images as inputs and output which digits the images display.

📗 (Part 1) Read and download the training set images and labels from MNIST (this has restricted access now, please use the CSV files instead) or CSV Files (easier to read) or the same dataset in another format from other places. You can also use a smaller subset of the training set to speed up training.

📗 (Part 1) Extract the training set data of the digits ? (label it 0) and ? (label it 1). Suppose there are \(n\) images in your training set, you should create an \(n \times 784\) feature matrix \(x\) and an \(n \times 1\) vector of labels \(y\). Please rescale so that the feature vectors contain only numbers between 0 and 1. You can do this by dividing all the numbers by 255. The training images contain \(28 \times 28 = 784\) pixels, and each pixel corresponds to an input unit.)

📗 (Part 1) Train a logistic regression on the dataset and plot the weights in a 28 by 28 grid.

📗 (Part 1) Predict the new images in the following test set. The predictions should be one of 0 or 1.

Note: this field may take a few seconds to load. You can either use the button to download a text file, or copy and paste from the text box into Excel or a csv file . Please do not change the content of the text box.

📗 (Part 2) Train a neural network with one hidden layer. The number of hidden units should be square root of the number of input units (here, the number of input units is 784, so the number of hidden units should be 28). The activation function you should use is logistic in both layers.

📗 (Part 2) Predict the new images in the same test set. The predictions should be either 0 or 1.

# Question 1 (Part 1)

📗 [1 points] Enter the feature vector of any one training image (784 numbers, rounded to 2 decimal places, in one line, comma separated):

Hint

📗 When reading the CSV file, you should read only the lines corresponding to the digits you are classifying and create an n by 1 array \(y\) to store the first column (either 0 or 1), and create an n by m array \(x\) to store the remaining columns.

📗 Make sure that you rescale (by dividing by 255) so that the feature vectors contain only numbers between 0 and 1.

Plot the image to make sure you entered the vector correctly:

# Question 2 (Part 1)

📗 [5 points] Enter the logistic regression weights and biases (784 + 1 numbers, rounded to 4 decimal places, in one line, comma separated), the bias term should be the last number:

Hint

📗 Create an m by 1 array \(w\) to store the weights and a number \(b\). Initialize them with random numbers between -1 and 1. (Initializing them with 0 is okay too, but it can make convergence slower.)

📗 Pick a learning rate \(\alpha\): the choice depends on the digits you are classifying, but you could start by trying \(\alpha = 0.1, 0.01, 0.001\) and look at how quickly the weights converge. You can also try decreasing learning rate such as \(\alpha\left(t\right) = \dfrac{1}{\sqrt{t}}, \dfrac{0.1}{\sqrt{t}}\) in iteration \(t\).

📗 Update \(w\) and \(b\) according the formula in the Lecture 2 slides:

\(a_{i} = \dfrac{1}{1 + \exp\left(-\left(\left(\displaystyle\sum_{j=1}^{m} w_{j} x_{ij}\right) + b\right)\right)}\) for i = 1, ..., n,
\(w_{j} = w_{j} - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right) x_{ij}\) for j = 1, ..., m,
\(b = b - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\).

📗 Note: if you want to use stochastic gradient descent, you should shuffle the dataset before going through \(i\) and update:

\(w_{j} = w_{j} - \alpha \left(a_{i} - y_{i}\right) x_{ij}\) for j = 1, ..., m,
\(b = b - \alpha \left(a_{i} - y_{i}\right)\).
Both Java and Python should have functions that can be used to shuffle an array.

📗 Remember to compute the cost at each step and store the cost from the previous step:

\(C = -\displaystyle\sum_{i=1}^{n} \left(\left(y_{i}\right) \log \left(a_{i}\right) + \left(1 - y_{i}\right) \log \left(1 - a_{i}\right)\right)\).

📗 Note 1: You can use other cost functions, but if you do, remember to change gradient descent formula, the above formula for updating \(w_{j}\) and \(b\) only works for cross-entropy cost.

📗 Note 2: Due to the problem that \(\left(0\right) \log \left(0\right)\) = NaN in many programming languages, the cost function should be written as \(C = \displaystyle\sum_{i} C_{i}\):

\(C_{i} = -\log\left(1 - a_{i}\right)\) if \(y_{i} = 0\),
and \(C_{i} = -\log\left(a_{i}\right)\) if \(y_{i} = 1\),
and \(C_{i}\) = something very large (like 10000) if \(y_{i} = 0, a_{i} \approx 1\) or \(y_{i} = 1, a_{i} \approx 0\).

📗 Note 3: To avoid computing \(\log\left(0\right)\), you can also bound \(a_{i}\) by \(0.01\) and \(0.99\), for example, \(a_{i} = \displaystyle\max\left\{0.01, \displaystyle\min\left\{0.99, a_{i}\right\}\right\}\).

📗 Repeat the previous two steps until the decrease in \(C\) is smaller than something like 0.0001, or \(C\) is smaller than a fixed number like 1, or the number of iterations (epochs) is too large (say 1000).

Now you can use your regression model to classify a digit you draw:

Activation: ?, Class: ?, Digit: ?.
Corresponding feature vector: .

# Question 3 (Part 1)

📗 [10 points] Enter the activation values on the test set (200 numbers between 0 and 1, rounded to 2 decimal places, in one line, comma separated). Please use the test set provided in Part 1 of the Instruction, not the test set from the MNIST website:

Hint

📗 Read the test set file: the first column is not \(y\), so just store everything in an array \(\hat{x}\). Remember to divide \(\hat{x}\) by 255.

📗 For each line \(\hat{x}_{i}\), compute the activation value by:

\(a_{i} = \dfrac{1}{1 + \exp\left(-\left(\left(\displaystyle\sum_{j=1}^{m} w_{j} \hat{x}_{ij}\right) + b\right)\right)}\).

# Question 4 (Part 1)

📗 [10 points] Enter the predicted values on the test set (200 integers, 0 or 1, prediction, in one line):

Hint

📗 If \(a_{i} < 0.5\), the label is 0, and if \(a_{i} \geq 0.5\), the label is 1.

# Question 5 (Part 2)

📗 [5 points] Enter the first layer weights and biases (784 + 1 lines, each line containing 28 numbers, rounded to 4 decimal places, comma separated). The biases should be on the last line: for the first 784 lines, line i element j represents the weight from input unit i to hidden unit j, and for the last line, element j represents the bias for the hidden unit j:

Hint

📗 Let h be the number of hidden units. Create an m by h array \(w^{\left(1\right)}\) to store the weights for layer 1 and an h by 1 array \(b^{\left(1\right)}\) to store the biases for layer 1. Create an h by 1 array \(w^{\left(2\right)}\) to store the weights for layer 2 and a number \(b^{\left(2\right)}\) to store the bias for layer 2. Initialize them with random numbers between -1 and 1. (Initializing them with 0 is okay too, but it can make convergence slower.)

📗 Pick a learning rate \(\alpha\), see Hints for Question 2.

📗 Batch gradient descent is too slow: use stochastic gradient descent instead, you should shuffle the dataset before going through \(i\) and update \(w\) and \(b\) according the formula in the Lecture 3 or 4 slides (please check to make sure they are correct!):

\(a^{\left(1\right)}_{ij} = \dfrac{1}{1 + \exp\left(- \left(\left(\displaystyle\sum_{j'=1}^{m} x_{ij'} w^{\left(1\right)}_{j'j}\right) + b^{\left(1\right)}_{j}\right)\right)}\) for j = 1, ..., h,
\(a^{\left(2\right)}_{i} = \dfrac{1}{1 + \exp\left(- \left(\left(\displaystyle\sum_{j=1}^{h} a^{\left(1\right)}_{ij} w^{\left(2\right)}_{j}\right) + b^{\left(2\right)}\right)\right)}\),
\(\dfrac{\partial C}{\partial w^{\left(1\right)}_{j'j}} = \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right) w_{j}^{\left(2\right)} a_{ij}^{\left(1\right)} \left(1 - a_{ij}^{\left(1\right)}\right) x_{ij'}\) for j' = 1, ..., m, j = 1, ..., h,
\(\dfrac{\partial C}{\partial b^{\left(1\right)}_{j}} = \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right) w_{j}^{\left(2\right)} a_{ij}^{\left(1\right)} \left(1 - a_{ij}^{\left(1\right)}\right)\) for j = 1, ..., h,
\(\dfrac{\partial C}{\partial w^{\left(2\right)}_{j}} = \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right) a_{ij}^{\left(1\right)}\) for j = 1, ..., h,
\(\dfrac{\partial C}{\partial b^{\left(2\right)}} = \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right)\),
\(w^{\left(1\right)}_{j' j} \leftarrow w^{\left(1\right)}_{j' j} - \alpha \dfrac{\partial C}{\partial w^{\left(1\right)}_{j' j}}\) for j' = 1, ..., m, j = 1, ..., h,
\(b^{\left(1\right)}_{j} \leftarrow b^{\left(1\right)}_{j} - \alpha \dfrac{\partial C}{\partial b^{\left(1\right)}_{j}}\) for j = 1, ..., h,
\(w^{\left(2\right)}_{j} \leftarrow w^{\left(2\right)}_{j} - \alpha \dfrac{\partial C}{\partial w^{\left(2\right)}_{j}}\) for j = 1, ..., h,
\(b^{\left(2\right)} \leftarrow b^{\left(2\right)} - \alpha \dfrac{\partial C}{\partial b^{\left(2\right)}}\).

📗 Note 1: It is very easy to make a mistake here. You can check if the gradient computation is correct by computing the numerical gradient using finite differences and compare it with your gradient:

\(\dfrac{\partial C}{\partial v} \approx \dfrac{C\left(v + \varepsilon\right) - C\left(v\right)}{\varepsilon}, \varepsilon = 0.0001\).
Here, \(v\) is one of \(w^{\left(1\right)}\), \(b^{\left(1\right)}\), \(w^{\left(2\right)}\), \(b^{\left(2\right)}\). Details see: Wikipedia.

📗 Note 2: Remember to compute the cost at each step and store the cost from the previous step:

\(C = \dfrac{1}{2} \displaystyle\sum_{i=1}^{n} \left(y_{i} - a^{\left(2\right)}_{i}\right)^{2}\).

📗 Note 3: See Hints for Question 2 for how to handle numerical problems when \(a_{i}, y_{i} \approx 0, 1\).

# Question 6 (Part 2)

📗 [5 points] Enter the second layer weights (28 + 1 numbers, rounded to 4 decimal places, in one line, comma separated). The bias should be the last number:

Hint

📗 See hint for the previous question.

Now you can use your network to classify a digit you draw:

Activation: ?, Class: ?, Digit: ?.
Corresponding feature vector: .

# Question 7 (Part 2)

📗 [10 points] Enter the output layer activation values on the test set (200 numbers between 0 and 1, rounded to 2 decimal places, in one line, comma separated). Please use the test set provided in Part 1 of the Instruction, not the test set from the MNIST website:

Hint

📗 For each line \(\hat{x}_{i}\) in the test set, compute the activation value by:

\(a'^{\left(2\right)}_{i} = \dfrac{1}{1 + \exp\left(-\left(\left(\displaystyle\sum_{j=1}^{h} a'^{\left(1\right)}_{ij} w^{\left(2\right)}_{j}\right) + b^{\left(2\right)}\right)\right)}\).

# Question 8 (Part 2)

📗 [10 points] Enter the predicted values on the test set (200 integers, 0 or 1, prediction, in one line):

Hint

📗 If \(a'^{\left(2\right)}_{i} < 0.5\), the label is 0, and if \(a'^{\left(2\right)}_{i} \geq 0.5\), the label is 1.

# Question 9 (Part 2)

📗 [1 points] Enter the feature vector of one test image that is labeled incorrectly by your network (784 numbers in one line, rounded to 2 decimal places, comma separated). You can look at a few images that your network is uncertain of (the second layer activation is the closest to 0.5): if you cannot find any, you can use one that you draw yourself too.

Hint

📗 *Spoiler* The first 100 images are for the first digit and the next 100 images are for the second digit. Please do not use this information and include the test images in your training set.

Plot the image:

# Question 10

📗 [1 points] Please confirm that you are going to submit the code on Canvas under Assignment P1, and make sure you give attribution for all blocks of code you did not write yourself (see bottom of the page for details and examples).

I will submit the code on Canvas.

# Question 11

📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.

📗 Answer: .

# Grade

* * * * *

* * * * *

# Submission

📗 Please do not modify the content in the above text field: use the "Grade" button to update.

📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.

Check the box to confirm submission.

📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted.

📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . You can also include the resulting file with your code on Canvas Assignment P1.

📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##p: 1" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.

📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.

# Solutions

📗 The sample solution in Java and Python will be posted around the deadline. You are allowed to copy and use parts of the solution with attribution. You are allowed to use code from other people (with their permission) and from the Internet, but you must and give attribution at the beginning of the your code. MOSS will be used for code plagiarism check: blocks of copied code without attribution will result in a zero for the whole assignment. For example, you can put the following comments at the beginning of your code:

% Code attribution: (TA's name)'s P1 example solution.
% Code attribution: (student name)'s P1 solution.
% Code attribution: (student name)'s answer on Piazza: (link to Piazza post).
% Code attribution: (person or account name)'s answer on Stack Overflow: (link to page).

📗 Sample solution from last year: 2020 P1. The homework is slightly different, please use with caution.

📗 Sample solution for 2022:

Java Part 1: File
Python Part 1: File
Java Part 2: File
Python Part 2: File

📗 You can get help on understanding the algorithm from any of the office hours; to get help with Python, please go to the TA's office hours; to get help with Java, please go to the instructor's office hours. For times and locations see: Home Page. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.

Last Updated: June 27, 2026 at 9:06 PM