CS 760: Homework 0
Creating and parsing your personal dataset
-
Assigned: Friday, September 2, 2011.
-
Due: Friday, September 9, 2011, 11am.
Creating Your Personal Dataset
Many learning systems work by processing a "training" set of labeled
examples,
where the examples are members and non-members of the concept the system
is supposed to learn. Through the first half of the semester you will use
the same database of examples, so that the relative performance of the
various systems you develop can be empirically judged.
This assignment serves to introduce you to the problems of representing
data from the "real-world" in a fixed-length feature vector. You
will use your dataset and data-file reader in subsequent homeworks.
Your first assignment is to:
-
Create your own database from any "real-world" dataset outside
of the
UC-Irvine
archive (http://www.ics.uci.edu/~mlearn/MLRepository.html). You can
use your own dataset or one from the following sites listed below, using
the
UC-Irvine
archive as an example of the type of dataset you are to build. Your
data set should end with a feature that you would like to predict; this feature
should be binary. Your data set should contain a mix of nominal and continuous
(numerical) features. If there are missing values, they should be denoted by a question mark.
-
Read about the machine learning toolbox Weka.
Convert your chosen dataset into Weka's .arff format.
Download Weka onto a computer of your choice, and start the Weka Explorer GUI.
Try doing classification using
the supervised machine learning program j48 within Weka. Run j48 on two of the example data sets that come with Weka, as well as on your own data set.