CS540 HW0: Hand-in Practice and Parsing ML Datasets

Assigned: Tuesday 9/6/16
Due: Tuesday 9/13/16 11:55pm (and no later than 11:55pm on Tuesday 9/20/16 so we can release a solution)
Value: 50 points

Note: this is for Professor Shavlik's CS 540 section.

Overview

The purpose of this assignment is for you to familarize yourself with the hand-in procedures that we plan to use for all assignments this semester. In addition, you'll get to refresh your memory of Java programming by writing a simple program that parses a real-world machine learning dataset; you will need this code for HW1, when we will learn decision trees from such datasets. Remember that you can attend Professor Shavlik's or Sam's (the TA's) office hours if you have any questions. These hours, along with an FAQ for each assignment, can be found via the course webpage.

What To Turn In

(10 points) A handwritten page which lists previous Computer Science courses that you have taken; if not UWisc classes, please include the NAME of the course and not just its course number (also list the name of the school were you took the class). Use a scanner and convert the page to a PDF named HW0.pdf. We will have several handwritten assignments this semester which you will be scanning in a similar fashion.
NOTE: listing previous CS classes could easily be done in a word processor (such as Microsoft Word) and the converted to PDF, but writing by hand the solutions to later HWs will likely be much easier, so please write by hand your answer to this part of HW0 so we all get experience turning in PDF versions of handwritten solutions.
It is your responsibility to make sure the image quality and handwriting of your turned-in HWs is reasonably easily readable or use risk losing many if not all points. Simply taking a photo with your phone may not work well. Campus libraries (eg, click here) have copiers where you can scan your handwritten HWs for free. A previous CS 540 student recommended the app called camscanner, so you might look into that or other apps.
(40 points) A Java source file named HW0.java which implements a simple dataset parser. Details for the program are described in the next section. There will be several programming projects included in this semester's homework assignments, which you will hand in like this.

This course is using a hand-in service called Moodle. If you haven't already, log in using your NetID and explore the course Moodle page. Under the HW0 Assignment you should be able to upload your PDF and Java source file.

Note that you should not include your login name in your files; Moodle will add that information when we download your Java code.

Dataset Parser Details

In machine learning, a dataset often contains examples that we want to classify with labels. For instance, here is a sample dataset which describes the fate of many passengers who were in the Titanic disaster. The task would be to predict which of the passengers (the examples) survived and died (the 'output' labels) based on a set of features, such as their gender, age, and socioeconomic status. For simplicity, we will only consider binary-valued features and outputs, i.e., those that have only two possible values (for instance, adult or child).

The datasets used in HW0, HW1, and possibly HW2 will follow a very specific format so that they can be parsed more easily (in CS 540 we will not be testing robustness to erroneous input). All comment lines (lines beginning with double forward slashes //) and blank lines should be ignored. The significant lines have the following order:

First, the number of features per example is specified as a single integer on its own line.
Next, each feature is listed on its own line followed by a dash and the two possible values (ex: age - adult child).
The two possible labels are listed next, one per line (ex: survived died).
Next, the number of examples in the file is listed on its own line.
Finally, the examples are listed one per line. The format is <example name> <label> <feature values>.

Please refer to the Titantic dataset to see a concrete case. We have included several comments to explain each section. Your code should be able to parse any correctly formatted dataset, not just the Titanic one. I suggest you create a small(and possibly meaningless) dataset of your own to aid in debugging - that is what we will be doing during grading your solution. Note that your code should handle arbitrary numbers of spaces and tabs between items and not just one space.

In HW1 we will also use this dataset: red-wine-quality-train.data. This data is a simplified version of this item in the ML repository at UC-Irvine. You should also try your HW0 code on this dataset, once you have things working on the Titantic dataset.

Your Java program needs to be able to read a dataset file in the format described above and output some statistics about the data. Specifically, your program's output should look something like:

There are 4 features in the dataset.
There are 120 examples. 
40 have output label 'survived', 80 have output label 'died'.

Feature 'type':
  In the examples with output label 'survived', 85.2% have value 'passenger'
  In the examples with output label 'died', 43.2% have value 'passenger'

Feature 'accomodations':
  ...

... and so on for each feature in the file

Your output does not have to have the identical format as above, but it must include the number of features, number of examples, number of examples with each output label, and the percentage of examples that have have a given value for both possible labels and every feature in the file. Note that the numbers above are just illustrative and not the actual numbers from the file. Again, your code should work with any dataset in this format. Recall that we will be grading your code with a different dataset than the one supplied.

You program should be able to be run via the following commands:

javac HW0.java
java HW0 <dataset filename>

Be sure to include your full name and NetID at the top of your HW0.java in a header.

To give you a jump start, we have provided the sample program ScannerSample.java, which shows how to use the Scanner class in Java. You may remember Scanners from CS 302 or 367. Scanners are useful for reading input from a file. The sample program reads in any text file and prints out each word from each line. Feel free to use parts or all of this program in your dataset parser.