Programming Assignment #2 for CS302ers Summer 2012

P2 Announcements

Corrections, clarifications, and other announcements regarding this programming assignment will be found below.

7/15/2012: If you want to work with a classmate, let me know by Wednesday midnight, July 18th.
7/15/2012: Program 2 assigned.

Overview:

Assignment Goals:

Learn how static methods improve code by making it modular and task oriented.
Implement programs that use multiple static methods.
Implement methods that call other methods.
Practice passing arguments to methods.
Practice returning values from methods.
Develop code using one-dimensional and two-dimensional arrays.
Access data from another class.
Read and interpret Javadoc method comments.
Gain more experience with concepts from your first programming assignment.

Background: Designing computer programs that learn from the outside world is a major field of inquiry in computer science. Machines that learn to recognize patterns which lead to particular outcomes are used to do everything from optical character recognition to predicting a person's chance of contracting a particular disease. One popular machine learning technique is a neural network (also called a learning network), which draws its inspiration from studies of learning in the human brain. A learning network consists of several nodes, called "neurons," connected together to form a network. Each neuron is responsible for a small amount of processing, and information passes between a neuron and the other neurons connected to it. Once it is created, a learning network must be trained to respond in the desired way (that is, to produce correct output when furnished with input). The network is trained by furnishing it with several examples of inputs paired with the corresponding expected outputs. If the network is properly constructed and provided with enough training examples, it will learn the patterns and can then be used to make predictions about test examples.

In this assignment, you will implement a simple kind of neural network known as a radial basis function network. This network will solve a type of problem called a classification problem. In a classification problem, the goal is to take a set of measurements which describe an example of something and then decide the classification for that example. For instance, one of the data sets (called "segment") which we will provide to you for this assignment contains image data from photographs of outdoor scenes. This data includes color values for each image as well as other measurements gathered by various image-processing algorithms. The expected output paired with each example indicates what texture predominates the image; textures include sky, grass, bricks, and other common outdoor surfaces. Once it has been trained with this data, a learning network will be able to accept the data as input and will produce as output an indication of what texture appears in the image. Your program won't achieve 100% accuracy in classifying image textures (or any other data set we provide), but it should be able to do far better than simple random guessing.

Specifications:

These specifications define some important concepts and describe how each method should be approached.

We strongly recommend that you incrementally develop your code since this is a very effective way to program. By this we mean, code a small part of the specifications, carefully test that it works as desired, and then add another small part. Developing code incrementally minimizes the places where you have to search for bugs to the part you recently added since you've already tested the previous parts. This saves lots of time since it's unlikely you'll need to look through your entire program for the problem. Please see the Development Milestones for part I: single neural network section below.

To begin download these two files, DO NOT select "Use project folder as root for sources and class files" in the "Project layout" section. Instead, we'll have this project have separate "src" and "class" (bin) folders. Put these Java files in your project's "src" folder directory:

NeuralNetwork.java
Dataset.java
datasets.zip (you will have to unzip this file. See Program Operation below for details)

NOTE: You must implement the methods EXACTLY as they appear in the program skeleton we've provided. By this we mean:

Do not change method names.
Do not add or remove parameters.
Do not change method return types.

Of course, you will need to change each method's body so that it works as described in the program skeleton. When grading, we will be testing the required methods individually to verify that they work as described.

You can add additional static methods as you feel are appropriate to complete the assignment.

Your program will generate random numbers to use when selecting the kernel set from the available training examples. As in program 1, you must generate your random numbers from a single Random object initialized with the proper seed value. We've made the random number generator in the main method for you and it has been passed as an argument to the methods where it needs to be used.

Terminology:

Classification: A category into which an example falls. The output of your neural network will be an indication of which classification is likely to correspond with the example given as input. For this program, a classification is represented as an integer. For example, the "segment" dataset has classifications of various surface textures (such as grass, sky, bricks, etc.), and the learning network is supposed to identify which texture appears in the input image. The classifications are the different textures with values of 1 for sky, 2 for bricks, 3 for foliage, and so on.

Example: An array of input values and a single expected output value. An example represents a single object in the real world (like a single image which needs to be classified). In your program, an example is represented as a double array (double[]). The last element in the array is the expected output (the example's classification) for the example, and the remaining elements are the input values.

Training Set: A set of examples which are used to train the network. During training, the network learns characteristics of each classification and adjusts the weights of the output nodes (perceptrons) accordingly. In your program, the training set is an array of examples, that is, an array of double arrays (double[][]). The row index into this 2-D array selects an example from the training set, and the second index chooses a value from the selected example.

Kernel Set: A set of examples, taken randomly from the training set, which are given to the comparison nodes in the network. When the network runs, these nodes will compare the input they receive with each of their examples, computing the distances between them. Like the training set, you will represent the kernel set as a double[][].

Test Set: A set of examples which are used to test the network once it has been trained. Once it has trained the network using the training set, your program will run each input from the test set through the network. The output of the network for each example in the test set will be compared with the example's expected output (the example's classification). Your program will calculate the percentage of examples which the network classifies correctly (that is, where the expected and actual outputs are the same). Like the training and kernel sets, you will represent the test set as a double[][].

Distance: The distance between two examples gives a measure of how dissimilar they are. Recall that an example is represented as an array of doubles. The last element in the array is the expected classification for the example; this element is ignored for computing distances. Given two examples a and b, each containing n elements (without the ignored final element), the distance d between them is defined by the Euclidean distance:
Gaussian: A mathematical function defined in the x-y plane. The domain of the function is all real numbers, and the range is (0, 1]. The only parameter to the gaussian function is the variance, which determines how wide the peak of the curve is. The graph below shows three gaussian curves with assorted variances. In your program, you will use the distances from the comparison nodes as inputs to the gaussian function, and the output of the gaussian (a number between 0 and 1) will be fed into the output node (also called the perceptron). You will write a method to determine the variance for your gaussian based on the training data.

Perceptron: A node in a neural network that takes several inputs, multiplies each one by a weight value, and sums them together to produce a final value. (This technique is also known as taking the weighted sum of the input values.) The weights in a perceptron are decided during training; this is how the network learns.

Weighted Sum: Given an array of weights a and an array of inputs b, each containing n elements, the weighted sum of the weights and inputs is given by the following formula:

Learning Network Concept:

The first step in training a learning network is to select a set of examples to use for training (usually the more the better). Next, for a radial basis function network, a subset of the training examples is selected as the kernel set. The layout of a radial basis function network is shown in the following diagram.

In this diagram, the input to the network is an array of numbers shown at the bottom. These numbers could be data from an image, measurements from a sensor, or medical information relevant to predicting a disease. Above the input is a row of comparison nodes. There is one comparison node for each example in the kernel set. Each comparison node is provided with the entire array of input values so that it can determine the distance between that input and its kernel example. If this distance is high, then the input and kernel example are dissimilar; a low distance means they are similar. For this program, we will use the Euclidean distance formula to compute the distance between examples having some number of values representing the input. Each comparison node then applies a Gaussian function (i.e. a bell curve) to its distance. This function produces values close to 1 for small distances and values close to zero for large distances. The result of the Gaussian function is then passed to the top layer of nodes, called the perceptrons or output nodes.

The general idea for this network is the following. Each comparison node produces a value which measures how similar the input is to that node's example. Each output node (perceptron) has the job of deciding whether the input falls into a particular classification. In our texture recognition example, there would be an output node for each possible image texture (sky, grass, bricks, etc.). Each output node combines the measures generated by the comparison nodes to decide if its particular classification is the correct one for the input. To do this, the output node calculates a weighted sum, which means that it multiplies the output of each comparison node by a weight value, adds all these results together, and then adds a constant offset. If the weights used by each output node have been correctly trained, the resulting weighted sum will indicate how likely it is that the input given to the network is from this output node's assigned classification. The output node with the highest value is most likely to be the input's correct classification. Therefore, once each output node's value is computed, we will pick the output node with the highest value (circled in red in the picture) as the input's classification, and this classification is the result of running our network.

The code you write for this project will perform the above process as well as the initial training process. During training, the set of training examples is used to gradually adjust the weights in each of the output nodes so that each node has a set of weights appropriate for distinguishing inputs of its assigned classification. The steps of the training algorithm are provided in the code skeleton.

Program Operation:

Your program does not have to deal with reading in data sets. We have provided a class called Dataset along with the data files. In order for your program to work properly, you must place the Dataset.java source file in your Eclipse "src" folder (the same folder that contains your main code file for this program). You will also need to download the datsets ZIP archive we provide and extract it to the folder "datasets" next to the "src" folder in your Eclipse project directory.

The Dataset class has several variables in it, shown below, which determine which data file to load.

     private static int datasetN = 1;
     public static double kernelProportion = .1;

Each time you run your program, you can tweak these variables to change the data set used. The datasetN variable determines which data set to use. A value of 0 indicates the "segment" dataset, 1 indicates "diabetes", 2 indicates "waveform-5000", 3 indicates "letter", 4 indicates "vehicle", and 5 indicates "balance-scale". While each data set comes from a different field or problem area, for the purposes of your program they can all be treated the same. Each data set simply provides a training set (a double[][]) and a test set (also a double[][]), which your program will use to train and run its learning network.

The kernelProportion variable determines what fraction of the set of training examples should be used as kernel examples; your program will have to use this variable to compute the kernel set.

Each data set we give you is already divided into a training set and a test set. You will use the training set as input to your training algorithm. Once your network is trained, you will run it on the test set. Each example in the test set includes a set of inputs and an expected output. Your program will compare the expected output for each example to the actual output produced by your network, and you will thus be able to measure the effectiveness of this machine learning technique on various kinds of data.

Development Milestones for part I: single neural network: (60%)

Incrementally develop this program for single neural network by following the milestones in the order listed below. This list starts with the simpler methods that are used by others and works its way up to the methods called in main. Verify that your implementations work properly, before moving on to the next milestone.

Milestone 1 - Basic Mathematical Methods (16%). Implement the following: calcDistance, applyGaussian, calcWeightedSum, and findMaxIndex. These methods are all relatively basic mathematical functions and do not require any special knowledge of how neural networks operate.
Milestone 2 - Initial Calculations (12%). Implement the following: selectKernelSet and findMaxDistance. These methods compute initial values (kernel set and variance) to start the learning algorithm.
Milestone 3 - Classification (10%). Implement calcPerceptronInput, and then implement classifyInput. This is where you will put the real logic of running the learning network on a given input to produce an output classification.
Milestone 4 - Training (15%). You should now have all the code to run the network on an input, but it won't work very well unless you write the code necessary to train the network. Implement the trainNetwork method.
Milestone 5 - Testing (7%). Once the network is trained, your program will run it on several inputs to see how well it learned. This will be done by the testNetwork, which you should now implement. Once you have done this, your program for part I: single network should be complete (since we have already given you the code for main). Make sure to test the whole program before you submit it!

Testing for single neural network:

For this program, no user input is read so you will not need to consider mistakes made by the program user. You also do not need to verify in each method that its parameter values are valid. Every method may assume that it receives valid parameter values, and you should make sure to provide valid parameter values every time you call a method.

Now that we are writing programs with multiple methods, testing the program for correct behavior becomes more involved. Each method can be tested individually by calling it with some inputs and checking if the output it produces is correct; this is often known as unit testing. Unit testing is often helpful; if you only test one small piece of your program at a time, you can easily determine where the cause of a test failure is in your code. In addition to testing of methods in isolation, the entire program can also be tested all-at-once by calling the main method and checking if the program's output is correct. This form of testing has the advantage that it tests your program as a whole, making sure all the pieces work together properly.

We have given you the tools to perform both kinds of testing on your program. The main method which we provide to you in the skeleton contains code to (1) run the entire learning network program on the chosen dataset (selected by the setting of the datasetN field in Dataset.java), and (2) test your methods individually and print out any test failures of individual methods. We have already written code which is called by main and which calls the methods you will write, passing them some sample inputs and checking to see that they produce the proper return values. You should not change the main method from how it is in the skeleton, with one exception: There is a boolean constant at the top of main called TEST. Changing this value controls whether the tests are run when you run your program. Set TEST to true to run the test, and it will print out the results of testing your individual methods, as well as the results of running the learning network on the chosen data set. If these tests print out failure messages, it means you have made a mistake somewhere in your code. If you don't want to see the test results printed out when you run the program, set TEST to false.

It is important to remember during testing that some of the more complicated methods you write (such as trainNetwork) will call some simpler methods which you will also write. If the simpler methods do not work properly, then the more complicated methods that call them cannot be expected to work properly either. By following the Development Milestones for part I: single neural network section above, you will develop the methods in the order they are used. If you find that a method later in the milestones list (such as findMaxDistance) is failing, you should check to see whether any of the methods that it calls (such as calcDistance) are also failing the tests. Fix these methods that are called before attempting to fix the method that calls them. If you develop and test the methods as listed in the milestones, you can be fairly sure that the mistake is in the methods of the milestone you're working on.

Part 2: Ensemble neural network (30%):

Discussion:

The part II of this programming assignment includes adding two additional features to your learning network algorithm: (1) variance tuning and (2) an ensemble of networks. You must do both to receive full points for part II.

Variance Tuning

In the part I of this program, we set the variance for all the gaussians to be the maximum distance between any two kernel instances. Depending on the data set, however, the network may perform better with a different setting for the variance. We would like to find out which variance will give the most accurate results for any particular data set. One way to determine this is to try the network with several different variances to see which one is best, and then use that one. You will write code to do this.

Ensemble

In this program, we randomly select a certain number of kernel examples from the training set. Which examples get picked for the kernel can impact the accuracy of network's answers in the end. How can we be sure that we pick good kernel examples if kernel example selection is random? One way is to create several neural networks, each with its own randomly selected kernel set. We can train all the networks with the same training data, and each network will maintain its own set of trained weights. Then, if we want to classify an input from a test set, we will run that same input through each trained network. Since the networks have different kernel sets, they may come up with different answers. We will have the networks vote on the correct answer, meaning we will pick the answer that receives the most votes from the networks in the ensemble. That way, networks with poorly-chosen kernel sets (which may cause low accuracy) will be counterbalanced during voting by networks with well-chosen kernel sets. The overall effect should be that the ensemble of networks is more reliable than any single network. While it is possible that a single network with a very fortuitous kernel set may outperform the ensemble, the ensemble is less dependent on random chance for its accuracy.

Details:

To get started on the part II of this assignment, download the two revised skeleton files below.

Because you will still need to submit valid code for the 1st portion of this assignment, we suggest that you create a new Eclipse project (possibly called NeuralNetworkEnsemble) and put the two "Ensemble" source files above in this new project. You will also have to copy the datasets folder into this project, setting it up in the same way as described above.

See the comments in NeuralNetworkEnsemble.java for details about what methods to implement to receive credit for part II. To summarize briefly: we have already provided a main method to run variance tuning and an ensemble of networks. Your job is to implement the following three methods: calcMaxAccuracyVariance, testNetworkEnsemble, ensembleClassify.

To test your implementation, you can run it using the default parameters supplied by DatasetEnsemble.java. These settings will test your implementation on the "vehicle" dataset (which involves identifying types of vehicles from silhouette images). If your implementation is correct, NeuralNetworkEnsemble.main() should print a final accuracy value of approximately 0.57647 (which is, by the way, an improvement over the accuracy achieved on this dataset without tuning and ensembles).

Submission:

In addition to submitting the file NeuralNetwork.java as described below, you will also need to submit a file NeuralNetworkEnsemble.java with your portion II code in it. In addition to the methods provided by the skeleton for NeuralNetworkEnsemble.java, you will have to include in this file all of the methods from your NeuralNetwork.java (except the main method, since the 2nd part skeleton defines its own main method).

Submission:

Before handing in both of your programs, check that you've done the following:

Did you verify that your single network program works correctly by following the information in the Testing for single neural network section above? 60% of your grade is based on your single network implementation.
Did you turn in ensemble network program? 30% of your grade is based on it.
Did you use good programming style by following the CS302 Style Guide? About 5% of your grade is based on good programming style.
Did you properly comment your program by following the CS302 Commenting Guide? About 5% of your grade is based on proper commenting. Note, we've provided most of your method header comments.
Is your program's 1st source file named exactly NeuralNetwork.java and is your class named exactly NeuralNetwork? Is your program's 2nd source file named exactly NeuralNetworkEnsemble.java and is your class named exactly NeuralNetworkEnsemble? Deductions are made if you fail to properly name these.

Submit these files to the dropbox for Programming Assignment #2:

"NeuralNetwork.java" containing your Java source code for part I with the original main method
"NeuralNetworkEnsemble.java" containing your Java source code for part II with the original main method

Program 2, Summer 2012

P2 Announcements

Overview:

Specifications:

Terminology:

Learning Network Concept:

Program Operation:

Development Milestones for part I: single neural network: (60%)

Testing for single neural network:

Part 2: Ensemble neural network (30%):

Discussion:

Variance Tuning

Ensemble

Details:

Submission:

Submission: