Building deep convolutional neural networks from scratch in Java, MNIST implementation — Full Walkthrough

Robert Hildreth
Geek Culture
Published in
11 min readJul 5, 2021

This is an extensive tutorial and walkthrough showing a method for using OOP to construct different convolutional networks using little to no major ML optimizations (i.e. vanilla networks). I will also demonstrate their usage on the MNIST dataset with an explanation on one method to obtain the images (edit: this explanation will come in a separate post). I am assuming familiarity with matrix convolutions as far as typical neural network operations are concerned. I also assume familiarity with Java as an object oriented language.

Forward pass example:

Backward pass example:

To begin, we will need an object to represent the image (e.g. the ‘i’ column on the left in the first picture). I creatively called mine ‘Image’ in this example. This image class will have multiple channels available to store data; Red, Green, Blue, Alpha for example. It will be constructable in several ways as it will represent the image both pre and post convolution (i.e. convolving an Image object will create another Image object), and will presented on the far side of the network as a solution. Note here that to allow the user access to the object opens the vulnerability of that object having been changed. I note this because typically we use the most recent output during training to calculate loss, which could come after user interaction. If this object had been manipulated by a user in the meantime, we could be thrown off here. Therefore we should build in a way to quickly make an independent clone of the object that we can feel more comfortable sending to the user, and you will see this included. Note also that CHANNELS is set equal to one and is not dynamic. It must be updated if not using a grayscale image. For MNIST, we can and should leave it alone.

This particular Image class utilizes special matrix classes to populate its channels, I call these HeatMaps. They harbor a private data matrix and provide operations for other HeatMaps such as adding, point multiplying, and subtracting. They are essentially used in place of any Matrix ideology for the network and are as follows:

Now we should have an operable Image. I will now demonstrate how to load several images from files stored locally on your computer (i.e. you must have all 70000 MNIST images stored on your machine split between two folders: ‘Test’ and ‘Train’ if you are to follow this tutorial. More on how to achieve this will come later). You will have to write your own custom PATH for where you have the pictures stored.

This proves to be a thoroughly speedy implementation, however it will utilize your entire CPU for the first seconds of the program. Regardless, now we have lists of Image objects created from real Images that we can begin to have all sorts of fun with. Indeed, we can now even outline the current project. The following is the main method inside of my program that runs the project.

As we follow from above we see that we actually create a network using tools that I have not yet presented. However, I think this is important to see now since it gives us a good overview of what it is exactly that we’re trying to accomplish. First, we want to design the network from the top down. This involves a little bit of math and forward thinking. There are several things to notice about this process that can make it sufficiently simple to be done in your head, however there is still a formula that can help choose your kernel sizes. The new size w of the Image created by convolving Image I, width w0, by kernel K size k and stride s is given by *

w=(w0-k)/s + 1

For example, the MNIST images come as 28 x 28 x 1 images (grayscale). While building this network, I designed it to begin with five 6 x 6 x 1 kernels with stride 2. This severely down-samples the image. Running through the formula;

w=(28–6)/2 + 1 = 12

we see that the resulting Image, now effectively a ‘feature map’ by common terms, is a 12 x 12 x 5 image. Hereafter we will be limiting the kernel strides to one, as we no longer require violent down-sampling. Doing this again for the next layer, albeit with s = 1, with eight total 2 x 2 x 5 kernels results in a a feature map of size 11 x 11 x 8. Once more we see that application of a 3 x 3 x C kernel downsizes the image by 2 when stride is kept to one, and the application of a 5 x 5 x C kernel downsizes the image by 4 when stride is kept to one. The final layer acts as a fully connected network layer would (that is to say, as a classifier), as a 1 x 1 x C kernel over a 1 x 1 x C input is fully connected. At the end of the convolutional chain we result in a 1 x 1 x 10 Image as an output. Each of the four channels can now be said to contain information regarding the assuredness of the network that a given input Image object is a member of one of 10 classes, implying that an untrained network is really just a random guess generator. Once we have documented the kernel sizes and their respective strides and multiplicity (number of ‘neurons’ in that layer), we can allow ImgNet to construct us a network. Now that we have a lists of images and a network to utilize, we can train the network before sending it off for testing. At the end of all of this complexity resides a happy little reminder that absurdities can happen:

‘Hello World’

I will take this opportunity to dive into the training and testing phases of the project. I hope this will give a more thorough understanding of how the network is supposed to work if we actually get to see it in practice, despite not yet knowing the cogs behind the curtain. This particular implementation utilizes a multi-pass method that allows the network to crawl across the dataset as many times as needed to fulfill the exit requirement. Presently, that requirement is that the network is incorrect on no more than 6100 of the 60000 training samples after a complete sampling has taken place. In addition, note that the network will not train if it actually answered correctly. This will cause training to eventually speed up significantly, at the cost of times trained throughout a ‘visual epoch’ (i.e. seeing all examples). While the limit for correctness is arbitrary, I’ve concluded that this is a good trade-off that limits my runs to approximately 8–16 minutes for scores around 90+%. My most successful typically scores a 92% in roughly six and a half minutes, testing time included. (Edit note: architectural changes and changes in the exit conditions, while combined with the addition of momentum to this network, have since provided networks capable of scoring 97+% in roughly three hours). Note here that between every epoch the dataset is shuffled.

Now I feel comfortable moving on to the interior operations and specifications of the network itself. Though it’s hard to say, I do believe ImgNet is one of my favorite creations. Before going headlong into the core of the network however, I believe that it is prudent to showcase some of the utilized peripheries and cover them in some depth. To begin, we may first take a look at the activation functions being utilized by the nodes, or ‘neuron’s of the network inside of each layer after the convolutions have been performed. As you may know, activation functions add a layer of non-linearity to the network, increasing its expressivity with depth. One method of optimization makes itself noticed here. In order to know which activation to apply, it is simple to give each ‘neuron’ an assignment and then use an if-then to check during runtime how to apply the activation. If this doesn’t make sense try to follow the following example: Let’s say that during training some network has a layer with neurons in it ready to activate their results. Each neuron must check if(activation==SIGMOID) then calculate activation, else if (activation==RELU) then calculate activation. I strove to avoid this approach because of the plethora of branches that it casts for no real gain. In order to optimize this approach, I create an enumerated constant that defines all activations, and choose which of the Enums to provide to each layer during compilation, so that it can be directly referenced to provide its activation during runtime with no branching. Simply calling the Enum is enough. The sample from the interior of my Neuron class follows (please forgive the line-skipping, as the code not present is not important for the state of the network).

Now we really need to break down what we’re trying to accomplish (here comes the hairy stuff). Each network will be built of successive sampling ‘Layers’, and each Layer will consist of a slew of ‘Neurons.’ Each Neuron will have a number of ‘Striding Kernels’ equal to the depth of the image after processing by the previous layer. Each Striding Kernel will be a controller and manager for a ‘Window,’ which acts here as a subtype of HeatMap that adds the functionality of having a source point (location relative to another grid, e.g. the input HeatMap), and another nested HeatMap that contains the data for the deltas of the Window’s effective ‘weights.’ This will allow the window to save a memory of what is passed through it, and update itself accordingly when requested to. Notice that updating resets the always-accumulating weight deltas back to zero, and there is functionality built to reset the deltas without updating. This allows for unadulterated accumulation over the next run and one of these two functions to reset the deltas must be applied if one is to consider continuing to train the network. Before getting disheartened and bogged down, let us see the scope of the project by taking a look at the ImgNet code itself, and seeing how it builds and utilizes these components.

Nothing too complex going on here. This simply manages seeing the images through the network in either direction and provides some useful SoftMax functionality while it’s doing so. One additional thing I would like to note is that the only packages I’ve imported so far come from either my other folders utilized in the hierarchy, or from java itself. I truly mean from scratch (self-applause concluded). In this we see the requirement for an object called a Layer. This layer will manage its neurons and report their conclusions. This layer is as follows.

Great! Now we have the Layer that the Network class requires. But, as you can see and as described earlier, this Layer class depends on the Neuron class. The Neuron class typically does the heavy lifting but, as we’ll see, I’ve redelegated some of the pure maths into another class, leaving my Neuron as simply the manager of a set of Sliding Kernels. Let’s take a look:

I will stop here momentarily to point out that we have already seen this code and the following nested Enum that describes our activations. I will succinctly skip reposting the same code.

Once again we have a dependency that I have not yet addressed. That is the Neuron’s dependency on what I call the Sliding Kernel. I suppose I should note that I never flip the convolving kernels so this network really operates more on the side of cross-correlation than convolution, but such nuance seems lost so often that I’m still sticking with the common name for this procedure. The sliding kernel contains the logic for repositioning its special HeatMap, called a Window, over the input HeatMap and performing pointwise multiplication and summation at each instance of overlap. This is the true process of the convolution.

Excellent! Almost there! We have one dependency left and I will refrain from stalling too much. The Window class is a special type of HeatMap that can keep information containing what passes ‘through’ it. This is necessary for backpropagation. Perhaps a more in-depth writing will tackle this issue but, for now, here is the operating code.

Viola! There you have it. You should now be able to score well on the MNIST dataset, or simply categorize images in general. I will make a separate post on exactly how I got the images from MNIST saved onto my computer. But this network is capable of computation with any general image that Java can extract a raster from. If you actually read this and have any comments or questions, please feel free to reach out to me. Your viewership is appreciated. Namaste.

  • Do note here that this formula does not include the concept of padding the input, as I have not incorporated that into this construct.

--

--