📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key) 1,2,3,4,5,6,7,8,9,10p2
📗 The official deadline is July 11, but you can submit or resubmit without penalty until August 15.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.
📗 Please do not refresh the page: your answers will not be saved.
📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.
📗 (Introduction) In this programming homework, you will build decision stumps and a decision tree to diagnose whether a patient has some disease based on their symptoms and medical test results. Unfortunately, we do not have a nice dataset on COVID-19, so we will use the Wisconsin Breast Cancer dataset. Your models will read in integer-valued patient data and output a diagnosis of whether the patient has breast cancer.
📗 (Part 1) Go to the website: Dataset, click on "Data Folder" and download "breast-cancer-wisconsin.data". Read the dataset description to figure out which variables are features and which variable is the label. There are lines containing "?", you could either remove them or replace them by a constant when parsing them into the feature matrix.
The list of variables is copied below:
1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)
📗 (Part 1) Train a binary decision stump (decision tree with depth 1) using the following feature: ? (indexed according to the above list). Report the counts and the information gain. Since the features are integer-valued, you could either try all binary splits and find the one with the maximum information gain, or you could use the real-valued decision tree learning algorithm discussed in the lecture.
📗 (Part 2) Train a binary decision tree using the following features: ? (indexed according to the same list). Report the tree using the following format:
📗 Note: make sure you only use "x? <= integer" as the condition and you only return "2" or "4". Spaces and indentations do not matter. You can use any tie breaking rule you like for comparing information gain and finding the majority label.
📗 (Part 2) Classify the following patients using your tree. This is the test set.
You can either use the button to download a text file, or copy and paste from the text box into Excel or a CSV file. Please do not change the content of the text box.
📗 (Part 2) Prune the tree so that the maximum depth is ?. The root is at depth 0. You could do this with or without a validation set.
📗 [1 points] Enter the total number of positive and negative instances in the training set (two integers, comma-separated, in the order, benign, malignant).
Hint
📗 Count the number of 2's and 4' in the training set.
📗 [2 points] Enter the initial entropy at the root before the split (one number, rounded to 4 decimal places).
Hint
📗 See Lecture 6 slides for the formula for entropy. Say the integer you entered in the previous question is \(n_{2}, n_{4}\) and let \(n = n_{2} + n_{4}\), then the entropy is,
📗 [1 points] For the decision stump (Part 1), enter the number of positive and negative instances in the training set above and below the threshold (four integers, comma-separated, in the order: below-benign, above-benign, below-malignant, above-malignant).
Hint
📗 Suppose you split your variable \(x_{i}\) at \(t \in \left\{1, 2, ..., 9\right\}\), then the four numbers you enter should be:
(1) \(n_{2-}\) the number of 2's with \(x_{i} \leq t\).
(2) \(n_{2+}\) the number of 2's with \(x_{i} > t\).
(3) \(n_{4-}\) the number of 4's with \(x_{i} \leq t\).
(4) \(n_{4+}\) the number of 4's with \(x_{i} > t\).
📗 For this question, you do not have to find the optimal split, but you will need to write the code for Part 2 anyways. One simple brute force way to do it is to compute the information gain for each of \(t = 1, 2, ..., 9\) and find the \(t\) corresponding to the largest information gain. A more efficient way is to list the unique values of \(x_{i}\) and only compute the information gain for those values of \(t\).
📗 [5 points] Input the binary decision tree in the format described previously.
Hint
📗 You should not split according to the order in the list of features, you still have to find the feature (in the list) corresponding to the max information gain at each split.
📗 You can split the same feature multiple times.
📗 Basically, you should solve Part 1 for each \(x_{i}\) in your list, and find the \(i\) corresponding to the largest information gain.
📗 Given the optimal feature-split pair \(i, t\), split the dataset into two subsets, one containing all instances with \(x_{i} \leq t\) and the other containing all items with \(x_{i} > t\), and repeat the same process for the two subsets.
📗 Stop splitting when all instances in the subset has the same label.
📗 Stop splitting and use the majority label in the current subset if maximum information gain is 0.
Now you can use your tree to classify the following patient:
Feature vector (10 numbers, comma separated):
OR (make sure you leave the above text field blank):
1. Sample code number:
2. Clump Thickness:
3. Uniformity of Cell Size:
4. Uniformity of Cell Shape:
5. Marginal Adhesion:
6. Single Epithelial Cell Size:
7. Bare Nuclei:
8. Bland Chromatin:
9. Normal Nucleoli:
10. Mitoses:
Label: ?.
Corresponding feature vector: .
📗 [15 points] Input the class labels on the test set (200 integers, either 2 or 4, comma separated, in one line).
Hint
📗 Start from the root, if \(x_{i} \leq t\), go to the left subtree, otherwise, go the right subtree. Stop if the subtree is a leaf and return the label.
📗 [5 points] Input the pruned binary decision tree in the format described previously.
Hint
📗 It is possible that you can get high accuracy after just cutting the tree at the desired depth, but you should still try some pruning.
📗 In practice, you should not use the test set in training and validation, but for the purpose of this homework, you can prune the tree using the test set: *spoiler* the first 100 instances has label 2 and the last 100 instances has label 4.
📗 To prune the tree using the test set, you evaluate the accuracy of each subtree: if the accuracy is lower than classifying all instances as 2 (or classifying all instances as 4), you should replace this subtree by a leaf and use the majority label; if the accuracy is higher, continue pruning the subtrees.
Now you can use your pruned tree to classify the following patient:
Feature vector (10 numbers, comma separated):
OR (make sure you leave the above text field blank):
1. Sample code number:
2. Clump Thickness:
3. Uniformity of Cell Size:
4. Uniformity of Cell Shape:
5. Marginal Adhesion:
6. Single Epithelial Cell Size:
7. Bare Nuclei:
8. Bland Chromatin:
9. Normal Nucleoli:
10. Mitoses:
Label: ?.
Corresponding feature vector: .
📗 [15 points] Input the class labels on the test set (200 integers, either 2 or 4, comma separated, in one line).
Hint
📗 Start from the root, if \(x_{i} \leq t\), go to the left subtree, otherwise, go the right subtree. Stop if the subtree is a leaf and return the label.
📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.
📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . Please submit the resulting file with your code on Canvas Assignment P2.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##p: 2" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.
📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.
📗 The sample solution in Java and Python will be posted around the deadline. You are allowed to copy and use parts of the solution without attribution. You are allowed to use code from other people and from the Internet, but you must give proper attribution at the beginning of the your code. MOSS will be used for code plagiarism check: blocks of copied code without attribution will result in a zero for the whole assignment.
📗 Sample solution from last year: 2020 P2. The homework is slightly different, please use with caution.
📗 Sample solution:
Java: File
Python: File
For both parts, you need to restrict to using a subset of the features based your version of the training set and test set.
For pruning, you can either implement error reduction pruning (optional), or cut the tree at the required depth (if you cannot get the desired accuracy, you can prune or train using the test set).
You need to write the code to print the tree in the format specified in the question.
📗 You can get help on understanding the algorithm from any of the office hours. To get help with debugging code in Java, please come during the Monday to Friday 2:00 to 3:00 Zoom office hours or Saturday to Sunday 2:00 to 3:00 (I can stay for a few hours after 3:00 by appointment) in-person office hours. To get help with debugging code in Python, please come during the Tuesday 3:00 to 5:00 in-person office hours or the Thursday 3:00 to 5:00 Zoom office hours. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.