📗 Enter your ID here: and click 1,2,3,4,5,6,7,8,9,10
📗 The same ID should generate the same set of parameters. Your answers are not saved when you close the browser. You could either copy and paste your console output into the text boxes or print your output to text files (.txt) and load them using the button above the text boxes.
📗 (Introduction) In this programming homework, you will build decision stumps and a decision tree to diagnose whether a patient has some disease based on their symptoms and medical test results. Unfortunately, we do not have a nice dataset on COVID-19, so we will use the Wisconsin Breast Cancer dataset. Your models will read in integer-valued patient data and output a diagnosis of whether the patient has breast cancer.
📗 (Part 1) Go to the website: Dataset, click on "Data Folder" and download "breast-cancer-wisconsin.data". Read the dataset description to figure out which variables are features and which variable is the label.
📗 Hint: there are lines containing "?", you should be careful when parsing them into the feature matrix.
The list of variables is copied below:
1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)
📗 (Part 1) Train a binary decision stump (decision tree with depth 1) using the following feature: (indexed according to the above list). Report the counts and the information gain.
📗 Hint: since the features are integer-valued, you could either try all binrary splits and find the one with the maximum information gain, or you could use the real-valued decision tree learning algorithm discussed in the lecture.
📗 (Part 2) Train a binary decision tree using the following features: (indexed according to the same list). Report the tree using the following format:
📗 Note: make sure you only use "x? <= integer" as the condition and you only return "2" or "4". Spaces do not matter.
📗 Hint: you should not split according to the order in the list of features, you still have to find the feature (in the list) corresponding to the max information gain at each split.
📗 Hint: use any tie breaking rule you like for comparing information gain and finding the majority label.
📗 Important hint: you should stop splitting and use the majority label if maximum information gain is 0.
📗 (Part 2) Classify the following patients using your tree. This is the test set.
You can either use the button to download a text file, or copy and paste from the text box into Excel or a CSV file. Please do not change the content of the text box.
📗 (Part 2) Prune the tree so that the maximum depth is . The root is at depth 0. You could do this with or without a validation set.
📗 For the decision stump (Part 1), enter the number of positive and negative instances in the training set above and below the threshold (four integers, comma-separated, in the order: above-benign, below-benign, above-malignant, below-malignant).
📗 Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Warning: grading may take around 5 seconds. Please be patient and do not click "Grade" multiple times.
📗 Please copy and paste the text between the *s (not including the *s) and submit it on Canvas, P2.
📗 Please submit your code and outputs on Canvas, P2S.
📗 You could also save your output as a single text file using the button and submit this to P2S (with your code).
📗 Warning: the load button does not function properly for all questions, please recheck everything after you load. You could load your answers using the button from the text field:
📗 Saving and loading may take around 5 seconds. Please be patient and do not click the buttons multiple times.
📗 Questions 1 to 4 correspond to Part 1 and Questions 5 to 9 correspond to Part 2.
📗 You should split the same feature multiple times with different thresholds.
📗 You should stop splitting and use the majority label if maximum information gain is 0.
📗 You should not store the tree in an array (i.e not what I did in JavaScript), you should define a node class to store the variable names and thresholds, and create a tree with the nodes. If this explanation is not clear, you could read the notes from Professor Caraza-Harter's CS320: Link. In general, his course webpage is an excellent resource for all the programming tricks we are going to use in this course: Link.
📗 The slides containing the main algorithm with important formulas is Lecture 6 Slides 15, 16, 17, 19.
📗 The homework instructions from last year that are not that great: Link.
📗 A sample solution in Java and Python is posted below.
Important notes:
(1) Pruning is not done. You may need to prune the tree to get a high enough accuracy to pass the auto-grading.
(2) You need to figure out which variables to output yourself. The outputs from the solution are use for debugging purposes only.
(3) You are allowed to copy and use parts of the TA's solution without attribution. You are allowed to use code from other people and from the Internet, but you must state in the comments clearly where they come from!
Java code by Ainur Ainabekova: Link
Python code by Hugh Liu: Link.