Prev: A1 Next: A3
Back to week 3 page: Link


# A2 Assignment Instruction

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key)
📗 You can also load from your saved file
and click .
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The official deadline is June 17, late submissions within one week will be accepted without penalty.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.
📗 Please do not refresh the page: your answers will not be saved.
📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.
📗 Please report any bugs on Piazza.

# Warning: please enter your ID before you start!



📗 (Introduction) In this programming homework, you cluster countries into groups based on their economic performance, for example, inflation rate. You will estimate parametric models of this economic variable and cluster the countries using these parameters as features.

📗 (Part 1) Download the "Inflation, consumer prices" dataset (CSV or some other file format you prefer) from "the World Bank" website: Link. It is okay if you use another similar panel dataset (data points tracking the same variable over time for multiple countries), another data set on "the World Bank" website or from another website.

📗 (Part 1) Estimate a parametric model that describes the trend. There should be at least 5 parameters, you can add more of your own, but preferably less than 10. Remember to rescale the parameters so that they have the same range, for example, \(\left[0, 1\right]\). The list of parameters you should include (\(x\) is the time series data with length \(T\)):
(1) Mean: \(\hat{\mu} = \dfrac{1}{T} \left(x_{1} + x_{2} + ... + x_{T}\right)\).
(2) Standard deviation: \(\hat{\sigma} = \sqrt{\dfrac{1}{T - 1} \left(\left(x_{1} - \hat{\mu}\right)^{2} + \left(x_{2} - \hat{\mu}\right)^{2} + ... + \left(x_{T} - \hat{\mu}\right)^{2}\right)}\)
(3) Median: the one item (or the average of the two items) in the middle of the sorted list of \(x_{1}, x_{2}, ..., x_{T}\).
(4) Linear trend coefficient: \(\hat{\beta} = \dfrac{\left(x_{1} - \hat{\mu}\right) \left(1 - \hat{t}\right) + \left(x_{2} - \hat{\mu}\right) \left(2 - \hat{t}\right) + ... + \left(x_{T} - \hat{\mu}\right) \left(T - \hat{t}\right)}{\left(1 - \hat{t}\right)^{2} + \left(2 - \hat{t}\right)^{2} + ... + \left(T - \hat{t}\right)^{2} }\) where \(\hat{t} = \dfrac{1}{T} \left(1 + 2 + ... + T\right) = \dfrac{1}{2} \left(T + 1\right)\) (please do not use gradient descent to compute this coefficient).
(5) Auto-correlation of the data: \(\hat{\rho} = \dfrac{\left(x_{2} - \hat{\mu}\right)\left(x_{1} - \hat{\mu}\right) + \left(x_{3} - \hat{\mu}\right)\left(x_{2} - \hat{\mu}\right) + ... + \left(x_{T} - \hat{\mu}\right)\left(x_{T-1} - \hat{\mu}\right)}{\left(x_{1} - \hat{\mu}\right)^{2} + \left(x_{2} - \hat{\mu}\right)^{2} + ... + \left(x_{T} - \hat{\mu}\right)^{2}}\).
You can include other variables such as counting the number of years to double, quadruple, and octuple, etc. You can also use estimation of trends using curve-fitting algorithms, for example, linear regression coefficients for squared time, cubed time, etc. You can also use other auto-correlation measures, for example, ones with different time-lags.

📗 (Part 2) Use hierarchical clustering with single and complete linkages to cluster the states into \(k\) = ? clusters based on their parameter values. Use Euclidean distance. The clustering is graded by checking the similarity between your clustering and mine using the measure adjusted Rand index, for details see: Wikipedia.

📗 (Part 2) Use k-means clustering to cluster the states into the same number of clusters based on their parameter values. Compute final the cluster centers and the total distortion. Use Euclidean distance. The clustering is graded by checking whether the k-means algorithm converged, so it is okay if you stop at a local minimum (but try not to stop at an obviously bad one).

# Question 1 (Part 1)

📗 [1 points] Enter the time series for United States and another country of your choice (two lines, comma separated numbers, no rounding).
Hint Copy two rows from the dataset and reformat them correctly.




# Question 2 (Part 1)

📗 [2 points] Enter the parameter estimates for (1) to (5) (two lines, five numbers rounded to 4 decimal places per line).
Hint Use the formula in the instructions to compute the mean, standard deviation, median, slope, and auto-correlation. If you are unable to replicate the values for this question, you can use different ones for the next few questions.




# Question 3 (Part 1)

📗 [1 points] Enter the list of parameters other than (1) to (5) you estimated and a brief descriptions for each. If you did not use other parameters, enter "None".


# Question 4 (Part 1)

📗 [5 points] Input the parameter estimates as a matrix, one row for each country (more than 50 lines, each line contains at least 5 numbers, comma separated, rounded to 4 to 8 decimal places). Call the number of rows \(n\) and the number of columns \(m\) for later questions. Make sure you rescale the parameters so that they have the same range \(\left[0, 1\right]\) (or you can use something else if most of the numbers in the same column is 0 or 1) and please do not include an index column or a column with state names etc.
Hint
📗 Compute \(\hat{\mu}, \hat{\sigma}, \hat{m}, \hat{\beta}, \hat{\rho}\) and your other parameters (a total of \(m\) parameters) for each line. You can omit the countries without too few data points. For \(n\) number of countries, you should have \(n\) lines and \(m\) numbers each.
📗 Let \(p_{i j}\) be the number in row \(i\) column \(j\), then the rescaled entry in row \(i\) column \(j\) would be \(x_{i j} = \dfrac{p_{i j} - \displaystyle\min_{i \in \left\{1, 2, ..., n\right\}} p_{i j}}{\displaystyle\max_{i \in \left\{1, 2, ..., n\right\}} p_{i j} - \displaystyle\min_{i \in \left\{1, 2, ..., n\right\}} p_{i j}}\). You can rescale your parameters non-linearly too.




# Question 5 (Part 2)

📗 [10 points] Input the clusters from single linkage hierarchical clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated, where \(k\) is the one given in the Instructions based on your ID). For example, if your clusters are {1, 3, 5} in cluster 1 and {2, 4} in cluster 2, you should enter "0, 1, 0, 1, 0" or "1, 0, 1, 0, 1".
Hint
📗 Start with \(n\) clusters.
📗 Compute the distances between each pair of clusters, \(C_{i}\), \(C_{j}\) by,
\(d\left(C_{i}, C_{j}\right) = \displaystyle\min\left\{ \left\|x_{i}, x_{j}\right\| : x_{i} \in C_{i}, x_{j} \in C_{j}\right\}\).
📗 Merge the pair with the smallest distance between them.
📗 Continue until there are only \(k\) clusters.
📗 In case your Rand index is low: try breaking ties by giving preference to the cluster containing the item with the smallest index.




# Question 6 (Part 2)

📗 [10 points] Input the clusters from complete linkage hierarchical clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated, where \(k\) is the one given in the Instructions based on your ID). For example, if your clusters are {1, 3, 5} in cluster 1 and {2, 4} in cluster 2, you should enter "0, 1, 0, 1, 0" or "1, 0, 1, 0, 1".
Hint
📗 Start with \(n\) clusters.
📗 Compute the distances between each pair of clusters, \(C_{i}\), \(C_{j}\) by,
\(d\left(C_{i}, C_{j}\right) = \displaystyle\max\left\{ \left\|x_{i}, x_{j}\right\| : x_{i} \in C_{i}, x_{j} \in C_{j}\right\}\).
📗 Merge the pair with the smallest distance between them.
📗 Continue until there are only \(k\) clusters.
📗 In case your Rand index is low: try breaking ties by giving preference to the cluster containing the item with the smallest index.




# Question 7 (Part 2)

📗 [10 points] Input the clusters from k means clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated, where \(k\) is the one given in the Instructions based on your ID). For example, if your clusters are {1, 3, 5} in cluster 1 and {2, 4} in cluster 2, you should enter "0, 1, 0, 1, 0" or "1, 0, 1, 0, 1".
Hint
📗 Start with k random points in the dataset as the cluster centers \(c_{1}, c_{2}, ..., c_{k}\).
📗 Compute the clusters for each point \(x_{i}\) by finding \(k\) such that \(c_{k}\) is the closest (among the \(k\) centers) to \(x_{i}\).
📗 Recompute the cluster centers,
\(c_{k} = \dfrac{1}{\left| C_{k} \right|} \displaystyle\sum_{x_{i} \in C_{k}} x_{i}\), \(C_{k}\) is the set of points in cluster \(k\).
📗 Repeat until the cluster centers do not change.




# Question 8 (Part 2)

📗 [5 points] Enter the cluster centers from k means clustering (\(k\) lines, where \(k\) is the one given in the Instructions based on your ID, each line containing \(m\) numbers, comma separated, rounded to 4 decimal places).
Hint See the hints for the previous question.




# Question 9 (Part 2)

📗 [5 points] Enter the total distortion (use sum of squared distances) of the clustering from the previous two questions.
Hint
📗 Total distortion is the sum of the squares of distances from the points to the cluster center,
\(D = \displaystyle\sum_{i=1}^{n} \left\|x_{i}, c_{k_{i}}\left(x_{i}\right)\right\|^{2}\), \(x_{i}\) belongs to the cluster \(k_{i}\).


# Question 10

📗 [1 points] Please confirm that you are going to submit the code on Canvas under Assignment A2, and make sure you give attribution for all blocks of code you did not write yourself (see bottom of the page for details and examples).
I will submit the code on Canvas.

# Question 11

📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Answer: .

# Grade


 * * * *

 * * * * *

# Submission


📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.


📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted. 
📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . You can also include the resulting file with your code on Canvas Assignment A2.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##a: 2" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.



📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.

# Solutions

📗 The sample solution in Java and Python will be posted on Piazza around the deadline. You are allowed to copy and use parts of the solution with attribution. You are allowed to use code from other people (with their permission) and from the Internet, but you must and give attribution at the beginning of the your code. You are allowed to use large language models such as GPT4 to write parts of the code for you, but you have to include the prompts you used in the code submission. For example, you can put the following comments at the beginning of your code:
% Code attribution: (TA's name)'s A2 example solution.
% Code attribution: (student name)'s A2 solution.
% Code attribution: (student name)'s answer on Piazza: (link to Piazza post).
% Code attribution: (person or account name)'s answer on Stack Overflow: (link to page).
% Code attribution: (large language model name e.g. GPT4): (include the prompts you used).
📗 You can get help on understanding the algorithm from any of the office hours; to get help with debugging, please go to the TA's office hours. For times and locations see the Home page. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.





Last Updated: November 18, 2024 at 11:43 PM