CS 760: Homework 0
Creating and parsing your personal dataset
-
Assigned: Friday, January 22, 2010.
-
Due:Monday, February 1, 2010, 4pm.
Homework 0: 50 points
PLEASE INCLUDE A PHOTO OF YOURSELF, AT LEAST 3x5 INCHES, WHEN YOU TURN IN YOUR HW 0.
PLEASE WRITE YOUR NAME, DEPARTMENT, AND YEAR
(eg, 2nd-year graduate student in CS) ON THE BACK OF YOUR PHOTO; ALSO
LIST AI CLASSES YOU HAVE PREVIOUSLY TAKEN. MAKE SURE YOUR
PHOTO IS DETACHABLE SO I CAN KEEP IT WHEN I RETURN HW0.
Creating Your Personal Dataset
Many learning systems work by processing a "training" set of labeled
examples,
where the examples are members and non-members of the concept the system
is supposed to learn. Through the first half of the semester you will use
the same database of examples, so that the relative performance of the
various systems you develop can be empirically judged.
This assignment serves to introduce you to the problems of representing
data from the "real-world" in a fixed-length feature vector. You
will use your dataset and data-file reader in subsequent homeworks.
Your first assignment is to:
-
Create your own database from any "real-world" dataset except
those in the
UC-Irvine
archive (http://www.ics.uci.edu/~mlearn/MLRepository.html). You can
use your own dataset or one from the following sites listed below, using
the
UC-Irvine
archive as an example of the type of dataset you are to build:
-
Convert your chosen dataset into the required format (see Requirements
and Creating the Feature Names & Values File
section) for CS 760.
-
Write some code to read in the examples.names
and examples.data
files, making sure your code conforms to the required format (see point
#7 of Requirements and Creating
the Feature Names & Values File section) specifications.
Requirements
Choose a dataset that is of interest to you, hopefully one where you have
a reasonable idea of the meaning of the features used to describe the data.
You will need to reformulate your dataset to meet the following specifications:
-
Take a "real-world" dataset where the function being learned isn't "obvious"
(i.e. data that can not easily be classified into the appropriate class
or category without any error.)
-
The function being learned should be binary-valued. It is OK to choose
a multi-valued task, but you'll then have to decide how to reformulate
it into two categories.
-
Your dataset should have approximately 500 examples for category A and
500 for category B (i.e., you should have about 1000 training examples;
randomly delete examples if the dataset you choose has more than 1000).
-
The dataset you choose should contain at least three numeric-valued
features and at least three discrete-valued features. If necessary,
convert some numeric features into discrete features (e.g., a numeric size
value into one of {small, medium, large}); do not worry too much just where
to make the "cut points."
-
Any other features you include should have discrete or continuous values,
for example, any of the following types: boolean (binary values, such as yes/no
or 0/1), nominal (the possible values of the feature have no relationship among
them, e.g. color={black, white, purple}), or ordered (the possible values of
the feature are totally ordered, e.g. size={small, medium, large}
in the discrete case or salary=[0..1000000] in the continuous case).
-
For your discrete features, aim to have no more than a dozen possible values.
For example, instead of using the names of the 50 US states as the
value for, say, birthplace, use something like {Northeast, MidAtlantic, South,
Midwest, MountainWest, Pacific, AlaskaAndHawaii}.
-
You should delete all examples containing "missing" feature values
(you should have 1000 examples after discarding examples with missing
values). If after deleting you no longer have 1000 examples, then
quasi-randomly replace some of the missing values - see the TA or me
if you need to do this.
-
You should use the UC-Irvine data-file format for your CS 760 dataset,
which describes each example as a comma-separated list of feature values,
whose ordering matches that in the *.names file described below.
We'll be creating our own datasets for use during grading subsequent homeworks,
and your code should be able to read any data file formatted in the
UC-Irvine format. (Be sure your code allows any type of white space
between the values of the example's features. Also allow the Java "//"
type of comments, but you needn't worry about in-line, ie "/* */" comments.)
Creating the Feature Names & Values File
Most of the UC-Irvine databases have a *.names file that describes
the features used to represent examples. However, these "names" files aren't
all formatted in the same manner, though usually the information about
the features names and possible values is item #7 in these files. In CS
760 we'll need to standardize the
*.names file. The format of
this file should be as follows:
featureName featureType possibleValues
where
-
featureName is a token (i.e., a character string without
quotes and internal spaces)
-
featureType is either discrete or continuous (for
the output category, use output as the feature type)
-
possibleValues is a comma-separated list of tokens specifying (a)
the possible values for a discrete-valued feature or (b) the minimum
and maximum values for a continuous-valued feature
Java-style "//" comments can appear anywhere in the *.names file.
You need not deal with "/* */" style of comments. See
examples.names
for an example.
Reading in Data Files
You need to write some code that reads the examples.names and
examples.data files making sure your code conforms to the required
format (see point #7 of Requirements
and Creating the Feature Names & Values File
section) specifications. Specifically, your code should contain a main
function that takes two arguments:
HW0 examples.names examples.data
Notice that the names of these two files are passed in; this will allow
us to use different files to test your code during grading.
Your HW0 code should read the names file (you can assume it is legally
formatted), and write out (one line per feature) the name, type, and legal
values for each feature in the names file. Finally, it should read the
examples file, report any illegal feature values, and report how many examples
are in the file for each of the two categories.
The exact format of your output is not terribly important. The main
points of this exercise are to give students some "refresher" experience
with Java and to make sure you can read in the data files for subsequent
homeworks.
Here is a sample name (examples.names)
and data (examples.data)
file as well as sample code that shows how to do I/O
in Java.
What to Turn In
-
To hand in your work, copy your dataset and "names" file, HW0.java, HW0.class
and Makefile to:
~cs760-1/handin/{your login name}/HW0
using your actual CS login name in place of {your login name}. Add any
comments you need to make into a file README that you should also copy
into the submit directory.
Please note that you must place your dataset and "names" file as well
as your *.java and *.class files in this directory in order for the TA
to execute and grade your code.
-
Turn in a
written description that briefly explains the dataset you chose and describe
how you modified it. Include a description
of each feature and its possible values,
but don't turn in a printout of your entire dataset.
Also remember to attach the annotated photo mentioned above.