Note: this is for Professor Shavlik's CS 540 section.
The purpose of this assignment is for you to familarize yourself with the hand-in procedures that we plan to use for all assignments this semester. In addition, you'll get to refresh your memory of Java programming by writing a simple program that parses a real-world machine learning dataset; you will need this code for HW1, when we will learn decision trees from such datasets. Remember that you can attend Professor Shavlik's or Sam's (the TA's) office hours if you have any questions. These hours, along with an FAQ for each assignment, can be found via the course webpage.
HW0.pdf
. We will have several handwritten assignments this semester which you will be scanning in a similar fashion.It is your responsibility to make sure the image quality and handwriting of your turned-in HWs is reasonably easily readable or use risk losing many if not all points. Simply taking a photo with your phone may not work well. Campus libraries (eg, click here) have copiers where you can scan your handwritten HWs for free. A previous CS 540 student recommended the app called camscanner, so you might look into that or other apps.
HW0.java
which implements a simple dataset parser. Details for the program are described in the next section. There will be several programming projects included in this semester's homework assignments, which you will hand in like this.This course is using a hand-in service called Moodle. If you haven't already, log in using your NetID and explore the course Moodle page. Under the HW0 Assignment you should be able to upload your PDF and Java source file.
Note that you should not include your login name in your files; Moodle will add that information when we download your Java code.
In machine learning, a dataset often contains examples that we want to classify with labels. For instance, here is a sample dataset which describes the fate of many passengers who were in the Titanic disaster. The task would be to predict which of the passengers (the examples) survived and died (the 'output' labels) based on a set of features, such as their gender, age, and socioeconomic status. For simplicity, we will only consider binary-valued features and outputs, i.e., those that have only two possible values (for instance, adult or child).
The datasets used in HW0, HW1, and possibly HW2 will follow a very specific format so that they can be parsed more easily (in CS 540 we will not be testing robustness to erroneous input). All comment lines (lines beginning with double forward slashes //
) and blank lines should be ignored. The significant lines have the following order:
In HW1 we will also use this dataset: red-wine-quality-train.data. This data is a simplified version of this item in the ML repository at UC-Irvine. You should also try your HW0 code on this dataset, once you have things working on the Titantic dataset.
Your Java program needs to be able to read a dataset file in the format described above and output some statistics about the data. Specifically, your program's output should look something like:
There are 4 features in the dataset. There are 120 examples. 40 have output label 'survived', 80 have output label 'died'. Feature 'type': In the examples with output label 'survived', 85.2% have value 'passenger' In the examples with output label 'died', 43.2% have value 'passenger' Feature 'accomodations': ... ... and so on for each feature in the file
You program should be able to be run via the following commands:
javac HW0.java java HW0 <dataset filename>Be sure to include your full name and NetID at the top of your HW0.java in a header.
To give you a jump start, we have provided the sample program ScannerSample.java, which shows how to use the Scanner
class in Java. You may remember Scanners from CS 302 or 367. Scanners are useful for reading input from a file. The sample program reads in any text file and prints out each word from each line. Feel free to use parts or all of this program in your dataset parser.