CS 760 - Machine Learning
Homework 3
Assigned: March 15, 2010
Due: 4pm April 7, 2010
125 points
Perceptron Learning
Create a neural-network system that can learn from data formatted in
the same manner as used in HWs 1 and 2. For simplicity, you should
only create perceptrons - that is, neural networks with no hidden units
(see Section 4.4 of Mitchell, especially Table 4.1, but use
the sgn function to discretize outputs before comparing them
to the teacher's outputs).
Discuss how you will represent the data types in CS 760 "*.names" files
for use by a numeric optimization method such as a perceptron.
Your code should automatically adjust the learning rate (eta) using
the method discussed in Lecture 19; let k=100 but feel free to only do eta-adjustment
once per epoch if doing it every k examples makes your code run too slow (i.e.,
adjust eta after the first 100 examples per epoch, then hold eta constant for the remainder of the epoch). You should also update
the weights after each training example (i.e., perform
stochastic gradient descent - see Equation 4.10 and the footnote
to Table 4.1), and don't forget to adjust the
bias (i.e., threshold) by treating it as another weight.
Have your code report the current value of eta every 10 epochs and include a plot of these values in your report.
Your code should also use 20% of the training set as a tuning set,
for use in "early stopping" (see Lecture 19) to prevent overfitting.
Limit your runs to 1000 epochs (feel free to use a lower number if
runtime is an issue; in that case report eta more frequently).
Your code should also perform "weight decay" (Lecture 19).
We really should tune the parameter lambda, but for simplicity
set lambda to be 0.01.
Have your code report the epoch chosen by early stopping
and report for the chosen perceptron state (i.e., the weight and bias values)
all those weights whose magnitude is larger then 0.1.
Be sure to also report the feature associated with the weights printed out.
10-fold Cross Validation
Run your perceptron code on your personal dataset using the same ten folds
you used in previously homeworks. Do a t-test comparison to the best method
from your previous homeworks. Report and discuss the results.
Using a Gaussian Kernel
In this part of the homework, you will use a Gaussian kernel to get a non-linear separating
surface. Instead of using linear or quadratic programming to solve the resulting optimization
task, you will simply use the gradient-descent method you developed for the first part of this
homework, that is a perceptron with weight decay.
To accomplish this, you will use a Gaussian kernel to create the "features"
for a perceptron (see lectures 21-22). For each fold in cross validation,
do this as follows (this is only one of many valid experimental designs):
- Normalize all the features of your examples to be in [0,1].
- Randomly choose 10% of the training examples. Call these "exemplars."
- Randomly select another (disjoint set of) 10% of the training examples for a tuning set.
Call this set kernelTune.
- The similarity to each exemplar will be the features given to the perceptron.
(Often in kernel-based approaches, all training examples
are used as these "exemplars," but to reduce runtime we will be using
the "reduced SVM" idea of Lee and Mangasarian mentioned in class.)
- We will use the Gaussian kernel as the similarity between two examples, A and B, where Ai and Bi are the ith feature of the examples:
kernel(A, B) = exp {- [ SUM (Ai - Bi)^2] / sigma^2}
- We will need to tune the value for sigma. We will simply
try {0.03, 0.1, 0.3, 1, 3, 10}.
For each candidate value of sigma, (1) create a dataset using the resulting kernel
and all examples except the tuning examples in kernelTune,
(2) have your perceptron code learn on it, and (3) evaluate the perceptron
state chosen by "early stopping" on the set kernelTune.
(Note that we are using two tuning sets in this design, one for choosing sigma
and one for deciding when to stop training.)
Give your "features" meaningful names like "similarityToPosExample5."
Report the tune-set accuracies for each of the possibles values for sigma and
report the perceptron (only those kernel-features for which the absolute value of the weight is greater than 0.1) that has the highest accuracy on kernelTune.
10-fold Cross Validation
Run your "kernel" code on your personal dataset using the same ten folds
you used in previously homeworks. Do a t-test comparison to the best method
from your previous homeworks and also do a t-test comparison
to your non-kernel perceptron. Report and discuss the results.
An Additional Experiment of Your Own Choice
Choose any two of the following extensions, implement them, and report on them,
including t-test comparisons to the best algorithm on your personal concept
(the two approaches above as well as previous homeworks).
Be sure to briefly say why you choose the experiments you did.
There is no guarantee that all of these options are equally hard, but
you can get full credit regardless of which you choose, so select the ones that are most interesting to you.
We will not be auto-grading this portion of the HW.
- Boosted perceptons for your non-kernel approach above (simply multiply the gradient by the example's current weight
or choose training examples for stochastic gradient descent proportional to their weight)
- Use ID3's information gain to scale the distances in the Gaussian kernel (such scaling
needs to be done separately on each fold of cross validation).
- Use voted perceptrons (see Lecture 20) for both your non-kernel and kernel experiments above.
- Neural networks with hidden units trained with backpropagation
- Use the kernel-based features created above with all of your previous homeworks (should be no need to
rewrite any old algorithms, which is why all is requested here)
- Create many random 'derived' features, convert your data into these features, and train a perceptron (with weight decay) on this representation.
One way to create random features is as follows. N (1000, say) times do the following to create a derived feature.
For each feature (including the '-1' feature for the threshold) in your dataset used in Part 1, randomly draw a weight in [-3,3],
and pass this weighted sum through the sigmoid function (i.e., the 'soft' step function). Don't forget to transform your TESTSET examples
the same way. Since these features are random, it is ok to do this ONCE for ALL folds, though also fine to create a fresh set
of random features for each fold.
- If you have taken a course like CS 525 (Intro to Linear Programming), use linear (or quadratic) programming to implement support vector machines
(if you choose this option, it is fine to use matlab, but you need to write your own code and not simply use some SVM package in Matlab; it
is fine to use an existing LP or QP solver - e.g., see http://en.wikipedia.org/wiki/COIN-OR#CLP)
Autograding
We will test your code on some datasets of our own.
The API for your code should be:
HW3 task.names train_examples.data test_examples.data useGaussianKernel
The last argument is Boolean-valued. If it is "true," use the
kernel-based approach developed for the second part of this
homework, otherwise use the perceptron, with weight decay,
using the task's features as the input units.
Your code should print out the test-set accuracy, as well as
the "names" of the miscategorized test-set examples. Recall that we name examples by their type and position in the testset, e.g., posTestEx1.
Requirements
Turn in a report of your experiments and a commented copy of the code
you wrote. Also turn in sample output from all your runs on your trainset/testset #1 (limit the sample output to one page per algorithm by editing the output your code produces).