CS540 HW0: Hand-in Practice and Parsing ML Datasets

Assigned: Tuesday 9/6/16
Due: Tuesday 9/13/16 11:55pm (and no later than 11:55pm on Tuesday 9/20/16 so we can release a solution)
Value: 50 points

Note: this is for Professor Shavlik's CS 540 section.


The purpose of this assignment is for you to familarize yourself with the hand-in procedures that we plan to use for all assignments this semester. In addition, you'll get to refresh your memory of Java programming by writing a simple program that parses a real-world machine learning dataset; you will need this code for HW1, when we will learn decision trees from such datasets. Remember that you can attend Professor Shavlik's or Sam's (the TA's) office hours if you have any questions. These hours, along with an FAQ for each assignment, can be found via the course webpage.

What To Turn In

This course is using a hand-in service called Moodle. If you haven't already, log in using your NetID and explore the course Moodle page. Under the HW0 Assignment you should be able to upload your PDF and Java source file.

Note that you should not include your login name in your files; Moodle will add that information when we download your Java code.

Dataset Parser Details

In machine learning, a dataset often contains examples that we want to classify with labels. For instance, here is a sample dataset which describes the fate of many passengers who were in the Titanic disaster. The task would be to predict which of the passengers (the examples) survived and died (the 'output' labels) based on a set of features, such as their gender, age, and socioeconomic status. For simplicity, we will only consider binary-valued features and outputs, i.e., those that have only two possible values (for instance, adult or child).

The datasets used in HW0, HW1, and possibly HW2 will follow a very specific format so that they can be parsed more easily (in CS 540 we will not be testing robustness to erroneous input). All comment lines (lines beginning with double forward slashes //) and blank lines should be ignored. The significant lines have the following order:

Please refer to the Titantic dataset to see a concrete case. We have included several comments to explain each section. Your code should be able to parse any correctly formatted dataset, not just the Titanic one. I suggest you create a small(and possibly meaningless) dataset of your own to aid in debugging - that is what we will be doing during grading your solution. Note that your code should handle arbitrary numbers of spaces and tabs between items and not just one space.

In HW1 we will also use this dataset: red-wine-quality-train.data. This data is a simplified version of this item in the ML repository at UC-Irvine. You should also try your HW0 code on this dataset, once you have things working on the Titantic dataset.

Your Java program needs to be able to read a dataset file in the format described above and output some statistics about the data. Specifically, your program's output should look something like:

There are 4 features in the dataset.
There are 120 examples. 
40 have output label 'survived', 80 have output label 'died'.

Feature 'type':
  In the examples with output label 'survived', 85.2% have value 'passenger'
  In the examples with output label 'died', 43.2% have value 'passenger'

Feature 'accomodations':

... and so on for each feature in the file 

Your output does not have to have the identical format as above, but it must include the number of features, number of examples, number of examples with each output label, and the percentage of examples that have have a given value for both possible labels and every feature in the file. Note that the numbers above are just illustrative and not the actual numbers from the file. Again, your code should work with any dataset in this format. Recall that we will be grading your code with a different dataset than the one supplied.

You program should be able to be run via the following commands:

javac HW0.java
java HW0 <dataset filename>
Be sure to include your full name and NetID at the top of your HW0.java in a header.

To give you a jump start, we have provided the sample program ScannerSample.java, which shows how to use the Scanner class in Java. You may remember Scanners from CS 302 or 367. Scanners are useful for reading input from a file. The sample program reads in any text file and prints out each word from each line. Feel free to use parts or all of this program in your dataset parser.