BIO-logical: Java implementation of a neural network
It is my great pleasure today to walk you through the creation and utilization of a heavily biologically inspired neural network paradigm I’ve dubbed Bio-logical. This will be a fairly long post, with details on each object from the dendrites to the axons and more. I also would like this to be accessible to those who have never created a neural network. For these reasons, I will take the time to introduce exactly what we’re trying to mimic; the neuron.
Think of the above picture as a tube that processes data from left to right. The dendrites collect input signals from the previous neurons and attempt to stimulate the nucleus. Upon successful stimulation, the nucleus will send an output signal down its own axon, across a synaptic gap, to the dendrites of another neuron. This process will continue in a chain.
Think of each neuron as making a decision. The nucleus can be thought of as an insurance provider, or agent, trying to make sure they only insure houses that pass inspection so as to make the most money possible. Dendrites can then be thought of as all of the agent’s colleagues on the phone, giving the agent a vote of yes/no on important new buildings, trying to sway the agent’s decision. The agent then picks up the boss group’s phone, or the axon, and relays his or her final decision. The agent is free to assign a ‘belief’ number to each of the incoming phones. Having never met any of the colleagues, it is impossible for the agent to know how much to trust them. However, when the boss calls back to tell the agent if it was a good insure or not, the agent remembers which of the colleagues said ‘yes’ or ‘no’, and marks their ‘belief’ number up or down if they were right or wrong. When a new call comes in about a new house, and one colleague votes ‘yes’ with a belief number -0.9 (negative), and the other colleague votes ‘no’ with a belief number 1.1 (positive), it is most likely that the agent will then phone the boss group’s line with an answer of ‘no.’ This ensures that the best classifications are made. Keep in mind that it is possible to trust somebody’s incorrectness more than somebody’s correctness. If the phone calls came in ‘yes’ with -1.4 and ‘yes’ with belief .8, it would still behoove the agent to vote ‘no’ overall, simply because the first colleague has a strong history of being wrong. These ‘belief’ numbers are the input weights for each neuron.
I want to use this analogy as a segue into introducing a few topics that don’t necessarily manifest from the simple situation above, but that are easily seen through its lens. These topics are dropout via ‘masking,’ activation functions, momentum, gradient clipping, weight decay, and mini-batch size.
Dropout is an amazing and simple concept that prevents over-reliance of the agent on any single colleague. Conceptually, we will be preventing some of the colleagues from calling the agent. In practice, this is done by masking the input by a filter designed to ‘zero-out’ and prevent some inputs. The agent may have 100 colleagues from which she expects phone calls before relaying her decision. With a dropout chance of 20%, it is reasonable to assume that only 80 phone calls make it through to her, in a random-selection. This forces the agent to make her decision using a smaller sub-net of colleagues, and limits her ‘belief’ updates to the few from which she received calls. Disrupting the system like this may not seem beneficial from the start, but soon the agent will realize more appropriate ‘belief’ values to apply to each colleague, instead of relying on a single one to be wrong or right. This method will take the agent longer to receive enough feedback from the boss group to appropriate these values than normal. The final trick with dropout is to disregard it after training. In the example, this implies that every phone call is now working. Now that the agent has fine-tuned her ‘belief’ values without over-reliance, she will begin to get phone calls from each colleague, all carefully weighted, before making her decision and lifting the next phone.
Activation functions are a unique topic in and of themselves. You’ll often hear that activation functions provide non-linearity. But what does that mean? Why do we need it? What is a linear model? To this point I have essentially described a linear model, which is to say a model with the Identity activation. The Identity activation is simply (in=out), where nothing is done to modify the input before the result. I offer now the concept of being excited or eager about a decision. This is where the agent truly comes into play. Without his or her own thoughts on the matter, the bosses may as well have simply bypassed the agent and queried the colleagues themselves. There would be no need for the intermediary to simply collect the colleagues’ answers before relaying them. In practice, any hidden layer with an Identity activation can essentially be removed and the Network made more shallow. Several Identity activation layers in a row are equivalent to a single Identity layer. To see this more clearly, let us make the agent very excitable. I will be taking a look at the SoftPlus activation to make sense of this.
We can see from the graph what happens to the agent’s decision based on the summary of the inputs. Inputs that push the agent towards ‘no,’ are mitigated and it takes massive influence to push the agent towards 0, or ‘absolutely no.’ However, we see that any positive incoming influences are met with eager approval. This will give the agent’s own perspective on the inputs, justifying to the bosses that he or she is indeed making an impact on the decisions and need not be fired. As far as the boss group’s corrections are concerned, the positive decisions made by the agent are going to be updated more vigorously than the negative decisions. See this in the slope of the graph. The greater the slope of the line, the more the neuron will update the weights used to generate the decision. To cast it back into the analogy, the more the agent was willing to risk their reputation for the decision, the more attention is payed to updating the ‘belief’ values of the colleagues. More aptly, decisions with SoftPlus made where the agent says ‘no’ by outputting zero are more akin here to simply removing oneself from the decision. By not having participated in the decision, the agent incurs no wrath from the bosses, and has little to do when it comes to updating the ‘belief’ values. Yes, there are absolutely activation functions that allow negative values to pass, and they will allow the agent to be forceful in their decision for ‘no.’ Consider also the comparison between being a close-minded agent and having a very flat activation function. No matter what the bosses say, the agent will take little to no action in updating the ‘belief’ values. This is common for an activation function like Sigmoid, which ‘saturates’ quickly towards a decision, becoming very assured of it. Sigmoid looks like this:
This leaves the agent in a position to either withdraw from the decision, or cast a modest vote, both with a very closed mind. In this project I include several activation functions, including some I’ve come up with myself. However, be aware that filling your insurance company with very excitable agents can cause some serious issues. I suggest pairing loudly active agents with layers of calm agents to mitigate their over-excitability. With this process I have found much greater success than simply using the same activation function for every layer.
Momentum smooths out the weight update process and can accelerate it if several updates are in the same direction. Likewise, it can dampen erratic updates that are not backed by other similar ones. Momentum typically will not only accelerate the training of the Network, but should allow the Network to find more effective minima in the loss-landscape. The agent in the running example could employ a ‘consistency’ score to each colleague. The more consistently they are right or wrong, the more comfortable the agent will feel in updating their ‘belief’ rating. Inconsistent colleagues will receive smaller updates, while a colleague that is reliably wrong will receive a larger update in the negative direction, implying to always do the opposite of what he or she suggests. This has the effect of taking a ‘running average’ of updates and applying the running average to the weights instead of applying the updates directly. This exposes consensus along the timeline of a colleague’s inputs, allowing the agent to classify them as ‘generally wrong’ or ‘generally right’ as opposed to undoing an update if they were right on one and wrong on the next. Momentum will come in a factor from [0,1). Zero momentum indicates that updates will be applied directly, while .99 momentum implies heavy reliance on the average over any individual update. For starters, it seems to be most beneficial to start without momentum, and see how adding it can affect the results. Often, a momentum value of .9 is chosen.
Gradient clipping, as simple as it is, may be even more simple to see in the analogy. The gradient is the information the agent gets on how wrong or right of a decision they made, and how much it should influence the agent’s ‘belief’ values. Clipping this enforces an absolute maximum or minimum here. This prevents the boss group from violently upsetting the agent’s ‘belief’ values with a massive update. With gradient clipping, the feedback that the agent receives is always bounded between two values. The current state of the Network uses an extraordinarily liberal gradient range of [-5,5].
Weight decay corresponds directly to ‘belief value’ decay in the analogy. What does this imply? There will not be a stationary weight in the Network, even if it isn’t receiving updates. Every round of training, weights lose a small percentage of themselves, towards zero. This implies that larger weights are affected more seriously than smaller weights. This is another way to prevent over-reliance on any single colleague, and will require several updates in a given direction to support a large weight in that direction, against decay. For further reading on this, look into L2 regularization in neural networks, as weight decay is its implementation. This Network uses a constant weight decay value of 0.0001.
The final topic I’d like to cover before coding is the implementation of mini-batches in the Network. While the Network can only analyze a single sample at a time, it has the functionality to accumulate data across runs and update every mini-batch’th run by averaging the values over the size of the mini-batch. This approach opens up the availability to incorporate on-the-fly changes to mini-batch size during training. While I am still working, I have not yet presented any astounding use for this, but it is definitely an availability to the user. Be aware, larger mini-batches, while they provide a more steady approach to the solution, may cause the Network to take longer to converge for one main reason: The Network now samples ‘mini-batch’ number of times, with involved training, before making a single update. Cautious steps become slower steps.
Bio-logical does not boast the most throughput for a neural network. However, it is broken down into small building blocks and comes with the tools to build them, while trying to emulate the process behind the neuron and its insurance analogy above. To many, this is the heart of object oriented programming. To me, well, Bio-logical is one of my favorites for this reason. This also has the unfortunate side effect of creating a larger project in my file space. I offer you here a self-sufficient grouping of 12 Network sub-directory classes accompanied by a Vector class in the Math sub-directory and and example XORGate class in the Testing.XORTesting sub-directory of the project.
I typically get started from ‘the top down,’ where we look first at how the project is put together, then track dependencies. However, I find this project more manageable if viewed from ‘inside to out,’ and will start with the basest of classes, the Vector. This simply houses a data array and provides functionality to operate on it or other Vectors and their respective data arrays.
Excellent! Now we have the primary workhorse for the project. Almost every calculation will be done using these vectors. I hope this will become more evident shortly.
The next step in our journey from ‘in to out’ will be to define the activation functions that we use in the Network. Here we are just labeling the types, not encoding the math that defines them. That math will take place later inside of the Nucleus class that applies the activation.
As you can see, I’ve provided quite a few potential activation functions to use in the Network. Most of these tend to be common, however SArSinH is a creation of my own. It is a very volatile activation, and for this reason should not be used in consecutive layers. Its violent activations need be mitigated by another layer of tapered activation. In fact, there is one in the list that fits perfectly, because it handles large inputs with grace. I have achieved greater success by pairing SArSinH (scaled hyperbolic arc-sine) with alternating Sine activation than simply chaining things like ELU (exponential linear unit) or SELU (scaled exponential linear unit). I truly encourage you to experiment with as many types as possible. If interested, you may even automate the process similarly to the method presented in another post of mine on genetic algorithms and parallel training. If you wish to incorporate your own activation in this project, the methodology is fairly straightforward. Simply add the name of the activation to the list here, and define its function and derivative inside of the upcoming Nucleus class. As long as you can define its derivative. If you’re not terribly familiar with derivatives, there are several resources online that can help you calculate them (think Wolfram Alpha.).
The Nucleus class is responsible for actually applying the activation to the weighted decision of the Dendrites, i.e. the agent becoming eager or despondent over the input of colleagues. It is a relatively small class with a simple goal, as we are about to see:
An astute eye may catch the useless update method at the bottom of the class. Please forgive me as this project is destined to become more than it currently is. Another keen observation may highlight the presence of EntropySigmoid, which was not listed in the activation list. Again, to get the project to compile, just add EntropySigmoid to the Activation enum. EntropySigmoid, another creation from a lot of free time, was designed to be used on the output layer and synthesize the utilization of log-loss while still utilizing the formula for quadratic loss. For the following reason, I have not incorporated the concept of a loss function yet in this post; I shall not do so simply for the fact that quadratic loss is easily interpreted as ‘how far was I from the target’ by simply subtracting the target from the calculated value. Other loss functions, while more complex, are not an important part of this project.
The next class I would like to introduce is an even smaller utility class called the Axon. The axon is simply a storage space for the activated value that will wait until it is accessed for the next layer. That is to say, this is where the decision of the neuron will sit until needed.
In anticipation, you may notice that we now have two of the three vital pieces of the Neuron. The missing link is what is known as the Dendrites class. The Dendrites class, while not quite as succinct and pleasant as the previous neural components, does a lot of the heavy-lifting for the process. This is where inputs are broken into components to prepare for the backwards pass, then processed by the weights of the Dendrites before being funneled into the nucleus for activation. I feel it appropriate here to cover in a bit more detail the backwards pass itself. We must store two ‘derivative vectors.’ These vectors, dwrtI and dwrtW (derivative with respect to Inputs, and derivative with respect to Weights, respectively), are needed during the backwards pass to appropriate error propagation through the Network and to the weights. There is no need for intensive calculus here, as the derivative of a chain of multiplications between weights and inputs (e.g. w1*i1 + w2*i2 + w3*i3 +..) is simply either the weights or inputs themselves. In fact, dwrtW is simply a copy of the input Vector, and dwrtI is simply a copy of the weights Vector. To cast it back to the analogy, the agent must retain information about what was advised by his or her colleagues and how much the agent believed them (the inputs and weights, respectively), in order to assign new ‘belief’ values after the boss group calls her back with an error report. Simply enough, each input, prior to activation, swayed the agent’s decision by the amount of ‘belief’ assigned to it. Correspondingly, each ‘belief’ value swayed the decision by the size of the corresponding input. While updating based on error, we must know the influence that each weight played on the error. This is given by dwrtW, or the input Vector. In order to tell his or her colleagues how much they influenced the error in turn, the agent must first scale the error by how much attention was payed to their input. With little attention, i.e. a small weight or ‘belief’ value, comes little influence on the error, as the colleague’s decision was mostly discarded, and the corresponding colleague will not have much to update with. Please note that, during back-propagation, the agent will have already applied the gradient of his or her own activation before any calculation of incoming influence. That is to say, that a hard-headed, or ‘saturated’, agent will have already minimized the backward running error before placing any blame or praise for the decision on colleagues.
Great! We now have all of the pieces of the Neuron. I will now house these segments inside of the Neuron class. This is a simple class that oversees receiving the inputs, processing them with the Dendrites, activation through the Nucleus, storage in the Axon, and offering direct access to the value stored in the Axon. This is simply a ‘pass-through manager’ of sorts that groups the functionality of the individual segments into one enclosed process, for both forward and backward propagation through the Network.
We now have a Neuron for a neural network! Fantastic!
Next step is to create layers. A single layer of neurons can be considered a neural network, when equipped with an input layer and an output layer. As you will see later on, adding layers will be a trivial and inviting and interesting endeavor. I would like to start first with the basic concept of a Layer. I offer here what is known in Java as an interface. This is a concept that implies and guarantees basic functionalities of any inheriting classes. The inheriting classes here will be the InputLayer, HiddenLayer, and OutputLayer. We can group them into a general Layer format in that each layer will essentially carry out the same processes as other Layers, albeit in different ways. To see it in action, let’s take a look at the code:
Notice that interfaces do not provide interpretations of the methods. They simply ensure that any inheriting class will have that functionality. This process will be opaque to the Network class still forthcoming. In fact, the Network class will only know that it is comprised of generic Layers, and will not care if they are InputLayers, or HiddenLayers, etc. We will take care in later steps to ensure that, upon building the Network, we insert the InputLayer and OutputLayer automatically. The user will only build the HiddenLayers of the Network. In this sense, the OutputLayer is simply a cap to apply over the last HiddenLayer and will be of the same size.
The InputLayer is going to be the most simplistic of Layers. It merely houses the input Vector, applies any requested Mask (recall the above section on Drop-Out), and hands it off to the first HiddenLayer. This, importantly, does not require any Neurons. Referencing the annotations above, it should be a fairly straightforward implementation.
Note that there is nothing to do for the update() or modifyBias() methods inside of the InputLayer, as it has no Neurons inside of itself to update, and does not modify the inputs with any bias.
In the OutputLayer, we will see the loss calculation at work. To calculate error, I’m using MSE, or mean squared error. As the name implies, this calculation is the average of the difference between each output and each target, each squared. This produces a positive value for error, as all squares will always be positive. It is the goal of the Network to minimize this number. The derivative of this ‘cost’ is used to calculate the direction and size of the ‘loss’ of the Network. The derivative of MSE is simply the difference between the output and the target value, scaled by the inverse of the number of outputs. If the guess of the Network was above the target value, this derivative result will be positive. Likewise, an estimation produced under the target value will provide a negative loss result during back-propagation.
Once again, there are a few unused methods inside of the OutputLayer. This is for the same reason as the InputLayer. There are simply no Neurons in this construct to beg the functionality.
The final Layer construct we require will be the HiddenLayer. This one will be significantly more involved, but keep in mind it is simply a ‘pass-through manager’ for the Neurons as the Neurons were for their respective components. This Layer should accept an incoming Vector of features, and respond with an output Vector of decisions, which may in turn become the feature Vector for the next HiddenLayer, or the response Vector for the OutputLayer. During back-propagation, it should receive a Vector of error information, and propagate it through the Neurons to build the influence Vector for the ‘previous’ (in this case ‘next’, as we are moving backwards during back-propagation) layer as it goes.
Great! We now have every type of Layer that we need for a complete Network.
Before building the Network, I would like to touch base on another concept. This may not seem apparent yet, but we will need some way to hold instructions to build these layers. The user will create the instructions, and the Network will utilize them to do physical construction. I want to introduce this concept now, though it will become apparent when used later, because it houses a specific enum that we need too. This enum is called MaskType. It will offer three available masks, one of which does nothing and is known as MaskType.NONE. The other two offer dropout Masks, one used for the input layer, and another used for the hidden layers. Input dropout, while it can still be useful, is a bit taboo for me. I just do not like the idea of not giving the Network all of the available information when I’m asking it to make a decision. I advise only using dropout on the input Vector if it is a very large Vector. The reason that input dropout is separate from the standard version is that often, (and I have no resources at hand to corroborate this,) it is wise to use a lower dropout rate on the inputs than on the hidden layer. I truly apologize for the lack of references here. Despite my misgivings, it is an option I have included. Tune these dropout rates and apply the Masks to the Network following any fanciful whim you may feel. There is truly no replacement for experience through exploration.
The final component of the Network may be self-evident. We need a Mask object. Before I introduce it, let me say I have not fully covered the topic of dropout. There is one slight modification to make rather than simply assigning zeroes at some random points in the Vector. This has to do with the ‘loudness’ of the incoming Vector. By assigning zeroes to some of its points, we are dampening the normal of the input Vector, effectively making it ‘quieter.’ When we disregard Dropout at the end of the training phase, we’re inadvertently going to overload the Neurons with all available input at full volume. To compensate for this, it is suggested by Geoffrey Hinton, et al, in the paper “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” that the Vector not subject to Dropout be scaled by the likelihood that it would have had to not be zeroed in the first place. For example, if our Dropout percentage were set to 20%, or the value 0.2, any Vector passing through the Network while the Dropout is not applied must be scaled by 80%, or the value 0.8***. This has the effect of dampening the vector so as not to overload the Neuron. I offer this interpretation of this process in practice:
Perfecto! We are now ready to build the Network around the components that we have made so far! The Network itself, once again, is truly just another ‘pass-through manager’ that adds a bit of functionality to query error. It also offers two methods for calculating error without actually training the Network. One method, known as dryTrain(), will only back-propagate through the OutputLayer. Then you can query the Network for its error as usual. Another is the static method getError(), which requires the Output Vector and the Target Vector to return the error. These methods will be equivalent.
I will show you a demonstration of the Network. In order to really show it off, I will task it to synthesize an XOR gate. In case you are not familiar, the XOR gate takes two inputs and results in a single value output. Consider two inputs A and B as binary, either 0 or 1. The XOR gate operates on them such to say “if one or the other, but not both.” This implies that, given an input Vector (0,0), the Network should output the Vector (0). Given (1,0), the output should be (1). Likewise (0,1) outputs (1). However, the XOR gate says not both, so the input (1,1) should result in an output of (0). I use this example for one major reason: while it is possible to run the Network with no HiddenLayers, it is not possible for the Network to reproduce the behavior of the XOR gate without a HiddenLayer. I encourage you to experiment with this. Only by adding at least one HiddenLayer should the Network be able to converge on this problem. Walking through the example below, we first build the Network, then we create our data and training Vectors. Do note, if you change the output Activation of the Network to TanH, you can and should change the outputs, that would otherwise be 0, to -1. TanH is a wonderful decision tool, and has a wider range than Sigmoid, implying a different threshold for the decision too. With Sigmoid, the decision is 1 if the Network chooses above 0.5, or 0 if it chooses below 0.5. With TanH, the decision is +1 if the Network chooses above 0, and -1 if it chooses below 0. TanH will allow for more active back-propagation than your typical Sigmoid, due to its more active derivative. However, I am not using typical Sigmoid in this demonstration, and am instead taking advantage of my EntropySigmoid. This monster knows no equal when it comes to the OutputLayer (Note: Despite my love for experimenting, I would advise against using this activation in any other layer, as it is very specifically designed. It would be the equivalent of taking a motor boat to a spaceship race). The example Network I will show you will use all the additional methods that the network employs, including weight decay, dropout, multiple layers, different activations, momentum, and a mini-batch size of four. This is a traditional example of extreme over-fitting******. Note that, due to the mini-batch size of the Network being set to 4, this process will only update the network 5000 times total, despite seeing 20000 samples.
Wunderbar! You’ve made it! Congratulations on developing your own BIO-logical network! I hope to hear of any applications you may find for it. Thank you for reading.
— — — — — — — — — — — — —
- *** This seems to leave a few interpretations for exactly how to re-scale components during dropout. I have a few other ideas that seem to work well on small-scale, synthesized datasets.
- ****** I would expect adding Drop-Out of any type to at least slightly hinder our chances extreme over-fitting. For this XOR gate, it still shouldn’t be a problem, considering a wide enough layer and not too extreme a drop-rate. All things considered, weight decay should still have a greater impact.