Machine Learning Java library

Project link: https://github.com/BrandonHan73/apple_lib

Abstract

To avoid writing the same code for multiple projects, I decided to make a general-use library containing various tools I may need. The most useful component ended up being a neural network framework. Improving and expanding this subsection of the library deepened my understanding of machine learning and optimization. In addition, it mitigates the requirement of using Python for machine learning projects. Most developers end up using Python for AI due to the various established libraries and the simplicity of the Python language. In contrast, no viable machine learning libraries are available for languages such as C++ and Java. The goal of this project is to provide a framework that allows Java developers to play a stronger contribution to the AI field.

Artificial Neural Network

A neural network is used to approximate a function. It takes a list of n numbers and creates an output containing m values. The starting point is therefore a Java interface that does exactly this. It takes an array of floating point values and returns an array of floating point values. For backpropagation, the function needs to be differentiable, so the interface needs another method that provides the derivative of the function evaluated at a given input location. I also added the implementations of some common activation functions like ReLU, softmax, softplus, hyperbolic tangent, logistic, and swish.
These activations functions do not have parameters. They will always do exactly the same thing. To allow the network to learn, we need a layer with parameters. The most common layer is the affine transformation, in which each output is a linear combination of the inputs, which an additional term added afterward. The linear combination is performed by a matrix multiplication with a weight matrix. The additional term added at the end is called the bias. To deal with the bias, the bias trick is used. This essentially just appends an extra input to the provided input, so the bias values can be stored in the weight matrix.
Because the linear activation has parameters, they must be learned through optimization. Backpropagation uses gradients to determine how to iteratively update each parameter. As such, a new abstract class is needed. This abstract class is initialized with the function that it would like to optimize. It should be able to take inputs with the corresponding derivatives and update the parameters of the function accordingly. The most basic optimization method is stochastic gradient descent. The creation of an optimizer allows for the first test using a dataset. I will use the MNIST dataset because it is easily accessible and simple to learn.
Incorrectly classifed image Incorrectly classifed image Incorrectly classifed image
Plot of the validation accuracy after each training cycle, along with three examples of incorrectly classified images.
The simple affine transformation was able to achieve around 70% accuracy on this training set. However, there are still many improvements to be made. The first is to connect two functions in series. After each affine transformation, a non-linear activation function is added to remove the linear property of the classifier. A new structure is therefore needed. This structure takes an arbitrary number of functions and connects them in series. The structure in itself is the composition of all the functions. If the components of the composition contain parameters, an optimizer is needed to learn the parameters. This new optimizer takes a list of optimizers for each function and runs backpropagation through each layer. As a demonstration, I connected four functions in series, two of which are affine transformations. The last layer is a softmax function, which ensures the output of the function models a probability distribution. As such, a loss function class can be defined using the cross entropy loss.
Plot of the validation accuracy after each training cycle. I used six different networks, with two for each type of activation function. Network size is represented by opacity and activation functions are represented by colors. The accuracy and convergence speed are noticeably lower than the original version.
The optimizer was still able to learn a function. However, three major problems arose. First is that the resulting test accuracy was not as high. Second is that the convergence rate was much lower. Finally, the program took much longer to run. To fix the first two problems, a few tricks can be used. The first is weight initialization. Each weight can be sampled from a Gaussian distribution. The variance of the distribution can be manipulated to control the variance of the outputs of each layer. If the variance of each layer remains around the same, optimization using gradient-based methods becomes much more efficient.
Plot of the validation accuracy after each training cycle. With weight initialization, the function reached an accuracy of 92%.
The magic of weight initialization will always be amazing to me. The accuracy and convergence rate improved astronomically. Doing the same experiment with the basic affine transformations generated similar results, but only reached a final accuracy of 85%. We can finally see that the multi-layer function can outperform the standard affine transformation. Another thing to note is that the larger networks took longer to converge. However, this is to be expected because there are more parameters to learn.
To further improve performance, other optimization techniques can be tested. These include momentum, ADAGrad, RMSProp, and Adam. The idea behind these algorithms is to use the first and second moments of the derivatives to deduce more information about the loss landscape, steering away from local minima and increasing convergence rate. The network does too well on the MNIST dataset now, so to compare the optimizers, the Cifar-10 dataset will be used with a larger network. This makes learning much harder.
Plot of the validation accuracy after each training cycle. Compares the results of different optimizers.
The accuracy on this new dataset is much lower. The complexity of the new dataset will require a deeper network. To make deeper networks easier to optimize, residual layers will be used. Residual layers reintroduce the outputs of past layers by adding these outputs to the outputs of future layers. This allows deeper networks to model shallow networks, so deeper networks rarely perform worse than shallow networks.