CS 760: Homework 0
Creating and parsing your personal dataset

Assigned: Friday, September 2, 2011.
Due: Friday, September 9, 2011, 11am.

Creating Your Personal Dataset

Many learning systems work by processing a "training" set of labeled examples, where the examples are members and non-members of the concept the system is supposed to learn. Through the first half of the semester you will use the same database of examples, so that the relative performance of the various systems you develop can be empirically judged.

This assignment serves to introduce you to the problems of representing data from the "real-world" in a fixed-length feature vector. You will use your dataset and data-file reader in subsequent homeworks.

Your first assignment is to:

Create your own database from any "real-world" dataset outside of the UC-Irvine archive (http://www.ics.uci.edu/~mlearn/MLRepository.html). You can use your own dataset or one from the following sites listed below, using the UC-Irvine archive as an example of the type of dataset you are to build. Your data set should end with a feature that you would like to predict; this feature should be binary. Your data set should contain a mix of nominal and continuous (numerical) features. If there are missing values, they should be denoted by a question mark.

Read about the machine learning toolbox Weka. Convert your chosen dataset into Weka's .arff format. Download Weka onto a computer of your choice, and start the Weka Explorer GUI. Try doing classification using the supervised machine learning program j48 within Weka. Run j48 on two of the example data sets that come with Weka, as well as on your own data set.

CS 760: Homework 0 Creating and parsing your personal dataset

Creating Your Personal Dataset

CS 760: Homework 0
Creating and parsing your personal dataset