Young Wu's Homepage

# Lecture Note

📗 Slides

Lecture 5: Slides, With Quiz
Lecture 6: Slides, With Quiz
Annotated Lecture 5 Section 1: Slides
Annotated Lecture 6 Section 1: Slides
Annotated Week 3 Section 2: Part I, Part II

📗 Typos and Mistakes

Lecture 5, Examples of Kernel Matrix -- Discussion, last line, added squared root 2 in front of x_1 and x_2.

📗 Websites

Support Vector Machine: Link
Kernel Trick Video: Link
RBF Kernel SVM Demo: Link
Favorite Fruit: Link
Decision Tree: Link
K Nearest Neighbor: Link
Map of Manhattan: Link
Voronoi Diagram: Link
KD Tree: Link

📗 YouTube Videos

How to find the margin expression for SVM? Link
Why does the kernel trick work? Link
Example (Quiz): Compute SVM classifier Link
Example (Quiz): Kernel SVM for XOR operator Link
Example (Quiz): Kernel matrix to feature vector Link
Example (Quiz): Entropy computation Link
Example (Quiz): Decision tree for implication operator Link
Example (Quiz): Three nearest neighbor Link

# Written (Math) Problems

Submit on Canvas: PDF

# Programming Problem

📗 Short Instruction

(1) Download the IMDB5000 Data from Kaggle or Data World. You can use Google Dataset search to find similar data set too: Google.
(2) Extract the columns for budget, genres, runtime (duration), vote_average (imdb_score), vote_count (num_voted_users) as features. Convert them into categorical data. For continuous or counting features, split using some uniform partition over the range: the number of categories can be numbers between 2 to 10 chosen arbitrarily. Remove all data points with missing features. (Note: please use the first genre in the list and combine some of the categories so that the total number of categories is less than or equal to 10.)
(3) Extract the revenue (gross) column to use as y. Convert it into a binary variable according to your wisc ID:
Type in your ID:
Split the column according to:

(4) Train a decision stump (a decision tree with 1 split) for each feature and compare them using information gain.

📗 Files to submit

(1) output.txt contains 5 lines, each line starts with the number of categories and the information gain due to the split, then list the number of positive and negative examples for each branch. (Positive example means an instance with label 1, and negative example means an instance with label 0.) For example, if originally, there are 2500 positive examples and 2500 negative examples, there are 5 categories that evenly split the positive and negative examples. Then the line should be 5, 0, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250.
(2) comments.txt contains information on how to run your program, in particular, the names of the data files are required.
(3) code.

📗 Things to try

(1) Experiment with dividing the continuous variables into different number of categories. Try numbers between 2 and 10.
(2) (Not required) You can try binary division for continuous variables at each data points (as in lectures) too.
(3) (Not required) Build the decision tree.

📗 Longer Instruction

The five lines should have the format:
Number of categories for feature j, Information Gain for feature j, #(y=1 and X_j=0), #(y=0 and X_j = 0) , #(y=1 and X_j =1), #(y=0 and X_j = 1), ... #(y=1 and X_j=k), #(y=0 and X_j=k)
The order should be:
j = 1: budget
j = 2: genres
j = 3: runtime
j = 4: vote_average
j = 5: vote_count

More (nonessential) details and hints: PDF.

Type in one line of your output:
Result :

📗 TAs' Solution

(1) Java: Link written by Tan
(2) Python: Link written by Dandi
Important note: You are not allowed to copy any code from the solution. MOSS will be used check for code similarity: changing just variable names and the spacing etc is still considered cheating. You can read and learn what the solution is doing but you MUST write all code yourself. The deadline for resubmission without 50 percent penalty is June 30.

Last Updated: November 09, 2021 at 1:05 AM