Homework 3

Assigned: March 15, 2010

Due: 4pm April 7, 2010

125 points

Discuss how you will represent the data types in CS 760 "*.names" files for use by a numeric optimization method such as a perceptron.

Your code should automatically adjust the learning rate (eta) using
the method discussed in Lecture 19; let *k*=100 but feel free to only do eta-adjustment
once per epoch if doing it every *k* examples makes your code run too slow (i.e.,
adjust eta after the first 100 examples per epoch, then hold eta constant for the remainder of the epoch). You should also update
the weights after each training example (i.e., perform
stochastic gradient descent - see Equation 4.10 and the footnote
to Table 4.1), and don't forget to adjust the
bias (i.e., threshold) by treating it as another weight.
Have your code report the current value of eta every 10 *epochs* and include a plot of these values in your report.

Your code should also use 20% of the training set as a tuning set, for use in "early stopping" (see Lecture 19) to prevent overfitting. Limit your runs to 1000 epochs (feel free to use a lower number if runtime is an issue; in that case report eta more frequently).

Your code should also perform "weight decay" (Lecture 19). We really should tune the parameter lambda, but for simplicity set lambda to be 0.01.

Have your code report the epoch chosen by early stopping and report for the chosen perceptron state (i.e., the weight and bias values) all those weights whose magnitude is larger then 0.1. Be sure to also report the feature associated with the weights printed out.

To accomplish this, you will use a Gaussian kernel to create the "features" for a perceptron (see lectures 21-22). For each fold in cross validation, do this as follows (this is only one of many valid experimental designs):

- Normalize all the features of your examples to be in [0,1].
- Randomly choose 10% of the training examples. Call these "exemplars."
- Randomly select another (disjoint set of) 10% of the training examples for a tuning set.
Call this set
*kernelTune*. - The similarity to each exemplar will be the features given to the perceptron.
(Often in kernel-based approaches,
*all*training examples are used as these "exemplars," but to reduce runtime we will be using the "reduced SVM" idea of Lee and Mangasarian mentioned in class.) - We will use the Gaussian kernel as the similarity between two examples, A and B, where Ai and Bi are the ith feature of the examples:
kernel(A, B) = exp {- [ SUM (Ai - Bi)^2] / sigma^2} - We will need to tune the value for
*sigma*. We will simply try {0.03, 0.1, 0.3, 1, 3, 10}.For each candidate value of sigma, (1) create a dataset using the resulting kernel and all examples

*except*the tuning examples in*kernelTune*, (2) have your perceptron code learn on it, and (3) evaluate the perceptron state chosen by "early stopping" on the set*kernelTune*. (Note that we are using two tuning sets in this design, one for choosing sigma and one for deciding when to stop training.)Give your "features" meaningful names like "similarityToPosExample5."

Report the tune-set accuracies for each of the possibles values for sigma and report the perceptron (only those kernel-features for which the absolute value of the weight is greater than 0.1) that has the highest accuracy on

*kernelTune*.

There is no guarantee that all of these options are equally hard, but you can get full credit regardless of which you choose, so select the ones that are most interesting to you.

We will not be auto-grading this portion of the HW.

- Boosted perceptons for your non-kernel approach above (simply multiply the gradient by the example's current weight
or choose training examples for stochastic gradient descent proportional to their weight)
- Use ID3's information gain to scale the distances in the Gaussian kernel (such scaling
needs to be done separately on each fold of cross validation).
- Use voted perceptrons (see Lecture 20) for both your non-kernel and kernel experiments above.
- Neural networks with hidden units trained with backpropagation
- Use the kernel-based features created above with all of your previous homeworks (should be no need to
rewrite any old algorithms, which is why
*all*is requested here) - Create many
*random*'derived' features, convert your data into these features, and train a perceptron (with weight decay) on this representation. One way to create random features is as follows.*N*(1000, say) times do the following to create a derived feature. For each feature (including the '-1' feature for the threshold) in your dataset used in Part 1, randomly draw a weight in [-3,3], and pass this weighted sum through the sigmoid function (i.e., the 'soft' step function). Don't forget to transform your TESTSET examples the same way. Since these features are random, it is ok to do this ONCE for ALL folds, though also fine to create a fresh set of random features for each fold. - If you have taken a course like CS 525 (Intro to Linear Programming), use linear (or quadratic) programming to implement support vector machines
(if you choose this option, it is fine to use matlab, but you need to write your own code and not simply use some SVM package in Matlab; it
is fine to use an existing LP or QP solver - e.g., see http://en.wikipedia.org/wiki/COIN-OR#CLP)

```
HW3 task.names train_examples.data test_examples.data useGaussianKernel
```

The last argument is Boolean-valued. If it is "true," use the
kernel-based approach developed for the second part of this
homework, otherwise use the perceptron, with weight decay,
using the task's features as the input units.
Your code should print out the test-set accuracy, as well as the "names" of the miscategorized test-set examples. Recall that we name examples by their type and position in the testset, e.g., posTestEx1.