CS 760: Homework 0
Creating and parsing your personal dataset


Creating Your Personal Dataset

Many learning systems work by processing a "training" set of labeled examples, where the examples are members and non-members of the concept the system is supposed to learn. Through the first half of the semester you will use the same database of examples, so that the relative performance of the various systems you develop can be empirically judged.

This assignment serves to introduce you to the problems of representing data from the "real-world" in a fixed-length feature vector.  You will use your dataset and data-file reader in subsequent homeworks.

Your first assignment is to:

  1. Create your own database from any "real-world" dataset outside of the UC-Irvine archive (http://www.ics.uci.edu/~mlearn/MLRepository.html). You can use your own dataset or one from the following sites listed below, using the UC-Irvine archive as an example of the type of dataset you are to build:
  2. Convert your chosen dataset into the required format (see Requirements and Creating the Feature Names & Values File section) for CS 760.
  3. Write some code to read in the examples.names and examples.data files, making sure your code conforms to the required format (see point #7 of Requirements and Creating the Feature Names & Values File section) specifications.

Requirements

Choose a dataset that is of interest to you, hopefully one where you have a reasonable idea of the meaning of the features used to describe the data. You will need to reformulate your dataset to meet the following specifications:
  1. Take a "real-world" dataset where the function being learned isn't "obvious" (i.e. data that can not easily be classified into the appropriate class or category without any error.)
  2. The function being learned should be binary-valued. It is OK to choose a multi-valued task, but you'll then have to decide how to reformulate it into two categories.
  3. Your dataset should have at least 25 examples for category A and 25 for category B (i.e., you should have at least 50 training examples), although more is better.
  4. The dataset you choose should contain at least three numeric-valued features and at least three discrete-valued features. If necessary, convert some numeric features into discrete features (e.g., a numeric size value into one of {small, medium, large}); do not worry too much just where to make the "cut points."
  5. Any other features you include should have discrete or continuous values, for example, any of the following types: boolean (binary values, such as yes/no or 0/1), nominal (the possible values of the feature have no relationship among them, e.g. color={black, white, purple}), or ordered (the possible values of the feature are totally ordered, e.g. size={small, medium, large} in the discrete case or salary=[0..1000000] in the continuous case).
  6. You should delete all examples containing "missing" feature values (you should have 50 examples after discarding examples with missing values). If after deleting you no longer have 50 examples, then quasi-randomly replace some of the missing values - see the TA or me if you need to do this.
  7. You should use the UC-Irvine data-file format for your CS 760 dataset, which describes each example as a comma-separated list of feature values, whose ordering matches that in the *.names file described below. We'll be creating our own datasets for use during grading subsequent homeworks, and your code should be able to read any data file formatted in the UC-Irvine format. (Be sure your code allows any type of white space between the values of the example's features. Also allow the Java "//" type of comments, but you needn't worry about in-line, ie "/* */" comments.)

  8.  

     

Creating the Feature Names & Values File

Most of the UC-Irvine databases have a *.names file that describes the features used to represent examples. However, these "names" files aren't all formatted in the same manner, though usually the information about the features names and possible values is item #7 in these files. In CS 760 we'll need to standardize the *.names file. The format of this file should be as follows:

featureName featureType possibleValues

where

Java-style "//" comments can appear anywhere in the *.names file. You need not deal with "/* */" style of comments. See examples.names for an example.

Reading in Data Files

You need to write some code that reads the examples.names and examples.data files making sure your code conforms to the required format (see point #7 of Requirements and Creating the Feature Names & Values File section) specifications. Specifically, your code should contain a main function that takes two arguments:
HW0 examples.names examples.data
Notice that the names of these two files are passed in; this will allow us to use different files to test your code during grading.

Your HW0 code should read the names file (you can assume it is legally formatted), and write out (one line per feature) the name, type, and legal values for each feature in the names file. Finally, it should read the examples file, report any illegal feature values, and report how many examples are in the file for each of the two categories.

The exact format of your output is not terribly important. The main points of this exercise are to give students some "refresher" experience with Java and to make sure you can read in the data files for subsequent homeworks.

Here is a sample name (examples.names) and data (examples.data) file as well as sample code that shows how to do I/O in Java.

What to Turn In

  1. To hand in your work, copy your dataset and "names" file, HW0.java, HW0.class and Makefile to:

  2. ~cs760-1/handin/{your login name}/HW0

    using your actual CS login name in place of {your login name}. Add any comments you need to make into a file README that you should also copy into the submit directory.

    Please note that you must place your dataset and "names" file as well as your *.java and *.class files in this directory in order for the TA to execute and grade your code.

  3. Turn in a written description that briefly explains the dataset you chose and describe how you modified it. Include a description of each feature and its possible values, but don't turn in a printout of your entire dataset.

  4.