University of Wisconsin Computer Sciences Header Map (repeated with
textual links if page includes departmental footer)

CS 760: Homework 0
Creating and parsing your personal dataset

Assigned: Friday, January 22, 2010.
Due:Monday, February 1, 2010, 4pm.

Homework 0: 50 points

PLEASE INCLUDE A PHOTO OF YOURSELF, AT LEAST 3x5 INCHES, WHEN YOU TURN IN YOUR HW 0. PLEASE WRITE YOUR NAME, DEPARTMENT, AND YEAR (eg, 2nd-year graduate student in CS) ON THE BACK OF YOUR PHOTO; ALSO LIST AI CLASSES YOU HAVE PREVIOUSLY TAKEN. MAKE SURE YOUR PHOTO IS DETACHABLE SO I CAN KEEP IT WHEN I RETURN HW0.

Creating Your Personal Dataset

Many learning systems work by processing a "training" set of labeled examples, where the examples are members and non-members of the concept the system is supposed to learn. Through the first half of the semester you will use the same database of examples, so that the relative performance of the various systems you develop can be empirically judged.

This assignment serves to introduce you to the problems of representing data from the "real-world" in a fixed-length feature vector. You will use your dataset and data-file reader in subsequent homeworks.

Your first assignment is to:

Create your own database from any "real-world" dataset except those in the UC-Irvine archive (http://www.ics.uci.edu/~mlearn/MLRepository.html). You can use your own dataset or one from the following sites listed below, using the UC-Irvine archive as an example of the type of dataset you are to build:

Convert your chosen dataset into the required format (see Requirements and Creating the Feature Names & Values File section) for CS 760.
Write some code to read in the examples.names and examples.data files, making sure your code conforms to the required format (see point #7 of Requirements and Creating the Feature Names & Values File section) specifications.

Requirements

Choose a dataset that is of interest to you, hopefully one where you have a reasonable idea of the meaning of the features used to describe the data. You will need to reformulate your dataset to meet the following specifications:

Take a "real-world" dataset where the function being learned isn't "obvious" (i.e. data that can not easily be classified into the appropriate class or category without any error.)
The function being learned should be binary-valued. It is OK to choose a multi-valued task, but you'll then have to decide how to reformulate it into two categories.
Your dataset should have approximately 500 examples for category A and 500 for category B (i.e., you should have about 1000 training examples; randomly delete examples if the dataset you choose has more than 1000).
The dataset you choose should contain at least three numeric-valued features and at least three discrete-valued features. If necessary, convert some numeric features into discrete features (e.g., a numeric size value into one of {small, medium, large}); do not worry too much just where to make the "cut points."
Any other features you include should have discrete or continuous values, for example, any of the following types: boolean (binary values, such as yes/no or 0/1), nominal (the possible values of the feature have no relationship among them, e.g. color={black, white, purple}), or ordered (the possible values of the feature are totally ordered, e.g. size={small, medium, large} in the discrete case or salary=[0..1000000] in the continuous case).
For your discrete features, aim to have no more than a dozen possible values. For example, instead of using the names of the 50 US states as the value for, say, birthplace, use something like {Northeast, MidAtlantic, South, Midwest, MountainWest, Pacific, AlaskaAndHawaii}.
You should delete all examples containing "missing" feature values (you should have 1000 examples after discarding examples with missing values). If after deleting you no longer have 1000 examples, then quasi-randomly replace some of the missing values - see the TA or me if you need to do this.
You should use the UC-Irvine data-file format for your CS 760 dataset, which describes each example as a comma-separated list of feature values, whose ordering matches that in the *.names file described below. We'll be creating our own datasets for use during grading subsequent homeworks, and your code should be able to read any data file formatted in the UC-Irvine format. (Be sure your code allows any type of white space between the values of the example's features. Also allow the Java "//" type of comments, but you needn't worry about in-line, ie "/* */" comments.)

Creating the Feature Names & Values File

Most of the UC-Irvine databases have a *.names file that describes the features used to represent examples. However, these "names" files aren't all formatted in the same manner, though usually the information about the features names and possible values is item #7 in these files. In CS 760 we'll need to standardize the *.names file. The format of this file should be as follows:

featureName featureType possibleValues

where

featureName is a token (i.e., a character string without quotes and internal spaces)
featureType is either discrete or continuous (for the output category, use output as the feature type)
possibleValues is a comma-separated list of tokens specifying (a) the possible values for a discrete-valued feature or (b) the minimum and maximum values for a continuous-valued feature

Java-style "//" comments can appear anywhere in the *.names file. You need not deal with "/* */" style of comments. See examples.names for an example.

Reading in Data Files

You need to write some code that reads the examples.names and examples.data files making sure your code conforms to the required format (see point #7 of Requirements and Creating the Feature Names & Values File section) specifications. Specifically, your code should contain a main function that takes two arguments:

HW0 examples.names examples.data

Notice that the names of these two files are passed in; this will allow us to use different files to test your code during grading.

Your HW0 code should read the names file (you can assume it is legally formatted), and write out (one line per feature) the name, type, and legal values for each feature in the names file. Finally, it should read the examples file, report any illegal feature values, and report how many examples are in the file for each of the two categories.

The exact format of your output is not terribly important. The main points of this exercise are to give students some "refresher" experience with Java and to make sure you can read in the data files for subsequent homeworks.

Here is a sample name (examples.names) and data (examples.data) file as well as sample code that shows how to do I/O in Java.

What to Turn In

To hand in your work, copy your dataset and "names" file, HW0.java, HW0.class and Makefile to:

~cs760-1/handin/{your login name}/HW0

using your actual CS login name in place of {your login name}. Add any comments you need to make into a file README that you should also copy into the submit directory.

Please note that you must place your dataset and "names" file as well as your *.java and *.class files in this directory in order for the TA to execute and grade your code.