The applications of machine learning are undeniably widespread. Many projects I have done in the past required some form of learning or neural network. My preference for experimentation lies in the Java programming language. However, Java does not have any commonly used machine learning libraries. I have thus repeatedly implemented artificial neural network frameworks across various different projects. This project is my solution to this problem. I submit a Java library containing highly generalized machine learning structures to facilitate future projects in a modular, efficient, and simple way.
The goal of machine learning is to estimate a function by optimizing certain parameters with respect to a loss function. The most common class of function used in machine learning is a matrix multiplication. When given an input vector, the output of the function is created by performing a matrix multiplication. The parameters to optimize are the entries of the matrix. This is called linear regression because the output is a linear function of the inputs.
In many cases, the function we would like to estimate is not linear. The class of linear functions is therefore not sufficient. To resolve this issue, non-linear functions are introduced. Some commonly used functions include the rectified linear unit (ReLU), the logistic function, the softplus function, and the swish function. There is also the softmax function, which normalizes a vector to a probability distribution. These non-linear functions are called activation functions. To create a neural network, inputs are passed through multiple different types of functions in sequence. Standard networks alternate between the linear regression layer and an activation layer.
We start with the classification problem. We would like our network to assign a class to each input item. The output of the network would then be a probability distribution across all possible classes. Optimization also requires a loss function. The most common loss function for networks that use the softmax activation function is the cross entropy loss, or the expected surprise for the correct output class across all items passed through the network.
This loss function is easy to differentiate. Using the derivative of the loss function with respect to the outputs of the neural network, backpropagation can be applied to find the derivatives of the loss function with respect to network parameters. A simple gradient descent algorithm can be applied to find a set of parameters with low loss.
Various techniques exist to improve the results of network optimization. A good way to improve convergence rate is to regulate variance across the network layers. The derivative of the loss function with respect to different network layers directly contributes to how parameters are updated. If these derivatives increase or decrease too much as they backpropagate, then the changes to the network parameters may be too extreme.
The linear layers are the biggest cause of this issue. If the parameters of the linear layers are initialized with zero mean and unit variance, then backpropagation will scale derivatives by a factor proportional to the number of inputs. To fix this issue, we modify the variance of weight initialization to be inversely proportional to the network inputs.
An update step when using stochastic gradient descent only applies information about the derivative evaluated during the current single iteration. Variations of stochastic gradient descent attempt to improve convergence rate and avoid local minima using information about previous iterations.
In my experimentation, all algorithms performed very similarly. However, I did not spend much time finding optimal hyperparameters, so I will not make any conclusions about the effectiveness of these variations.
Most machine learning applications use a GPU to accelerate computations, and for good reason. At this point, a single epoch of training with the current implementation takes around 27 seconds of runtime, even for this simple network structure. Unfortunately, I do not have a GPU in my mini desktop, so alternative methods must be explored.
Upon closer investigation, the majority of the computation time is spent on matrix multiplication. To improve runtime, the most basic strategy is to restructure executions to benefit more from hardware structures like caches. This includes merging loops that iterate over the same variables and do not have temporal dependence, changing the order of iteration to benefit from spatial locality, and using for-each loops. These seem obvious, but when developing code, it may often be beneficial to sacrifice execution speed for readability. The optimizations can be applied later, when the full code structure has created.
Another note is that this library was designed to be as general as possible. The main function class assumes that each function output depends on every function input. As such, a full matrix of derivatives is created during backpropagation, even if the matrix is known to be diagonal. A simple optimization is to define a subclass of the main vector function class that accounts for optimizations. This is useful because many activation functions have diagonal derivative matrices.
The most important runtime optimization is multithreading. Multithreading allows all CPU cores to be used in parallel, rather than restricting execution to a single execution head. The most obvious target for multithreading is the affine transformation. When multithreading is applied to this single point of interest, execution time is immediately halved.
A simple way to expand the multithreading capabilities is to allow the main vector function class to calculate its output for multiple inputs at the same time. This will also turn out to be useful when normalization techniques are applied in future versions. The runtime immediately sees significant improvement because the forward pass must be done for both optimization and for accuracy assessments.
The next area multithreading was applied was backpropagation calculations. In this implementation, backpropagation is mostly done by large matrix multiplications. However the backpropagation for affine layers were done separately, so the runtime improvement was not as significant. The affine layers require parameter updates. Applying multithreading to updating the parameters of the affine layers showed the most significant improvement.
Normalizing data is a common way to improve the performance of a network. Up until this point, I manually rescaled the input data to a zero mean and unit variance. It would be beneficial for the network to normalize these by itself. Normalization layers can also be added in the middle of a large network to regularize the mean and variance of intermediate layer outputs.
Batch normalization normalizes a value by taking the mean and variance across a batch of input data. During training time, mini-batch gradient descent is used, so every pass through the batch normalization layers can calculate a non-trivial mean and variance. During test time, items may be passed one at a time. To resolve this issue, batch normalization tracks an exponential average of the mean and variance values encountered during training. These running values will be used during test time.