Beginner guides to Convolutional Neural Network from scratch — Kuzushiji-MNIST

Satsawat Natakarnkitkul
Jun 3 · 8 min read

In the previous post, which you can check it out here, I have demonstrated various dimensional reduction techniques on Kuzushiji-MNIST (KMNIST) data set. In this post, I will build the convolutional neural network from scratch using keras to predict the class of KMNIST.

What is CNN?

Artificial Intelligence is growing a lot faster in the past few years, it is sub-field within computer science that aims to create intelligent machines.

There are a lot of research domain within AI, for example: speech recognition, natural language processing (NLP), and computer vision. I will focus on computer vision in this post.

Figure 2: Example of computer vision tasks (Source: Fei-Fei Li, Andrej Karpathy & Justin Johnson (2016))

The aim of computer vision is to enable machines, computers or programs to view the world as humans do, and apply the knowledge to specific tasks such as image and video recognition, image classification. This has been enabled with the advancement of deep learning algorithm, to be more specific, a Convolutional Neural Network (CNN / ConvNet).

The architecture of the Convolutional networks are the connectivity pattern of neurons in the animal brain, and are inspired by biological processes, the connectivity pattern between neurons resembles the organization of the animal visual cortex.

Convolutional neural network has the ability to capture the spatial and temporal dependencies in the input image through the application of relevant filters. It performs better to the image data set due to the reduction in the number of parameters involved and reusability of weights.

How does it work?

Convolutional neural networks (or ConvNets) do not view the images like we, humans, do. We view image as flat canvases with color, we may not care much about the width and height of the images but we can perceive those parameters.

Figure 3: RGB color composition

However ConvNets views images into different dimensional objects such as three-dimensional objects with the third-dimension as color encoding (or color channel), mainly red-green-blue (RGB) which combines and produces the color we see (of course, there are more than RGB, examples are grayscale, HSV, and CMYK).

Figure 4: 3x3x3 RGB images

The first key point in implementing ConvNets is the precise measurement of each dimension of the image as it will becomes the foundation of the linear algebra operation used to process the images.

Figure 4 demonstrates how ConvNets works on the RGB image, each layer represents the color (R-G-B) with the number represented the intensity (range between 0 and 255).

Figure 5: RGB(5, 1, 4)

Continue from figure 4, if we take the upper left pixel, RGB (5, 1, 4). Our eye will see purple-ish color.

Imagine now the digital pictures, a seven-megapixel camera may produce 3072 x 2304 pixels (I’m no expert in camera though!!), so how much computationally intensive this will be?

Convolutional neural networks (ConvNets) is to find which of those numbers are significant signals (by reducing the images into a form which is easier to process, and without losing features) that actually help it classify images more accurately. This is another important concept to keep in mind, when designing the ConvNet architecture which is not only good at learning features but it is more scalable to massive images data set.

Convolutional Neural Network Layers

Figure 6: Example of ConvNet architecture

There are several main layers implemented in ConvNets.

Conv2D layer

It is a 2-Dimension convolutional layer, mainly it is used as the layer to extract features from the raw input image. The first layer is responsible for capturing the low-level features such as edges, color, gradient orientation. With added layers, the architecture will attempt to capture high-level features.

Figure 7: 1-d convolution

Conv2D filters extend through the three channels in the image (red-green-blue). The filters may be different for each channel too.

The red area displays the region of current filter is computed and is called receptive field. The number in the filter is called weights or parameters. The filter is sliding (or convolving) around the image, it will compute the element-wise multiplications and fill the output to the output, which is called activation map (or feature map).

We can have multiple convolution layer to identify and capture high-level features, but this comes with more computational power.

Pooling layer

The next layer following the convolution layer is pooling layer (or down-sampling or sub-sampling). The activation maps are fed into a downsampling layer, this layer will applied one patch at a time (like convolution layer).

The pooling layer progressively reduces the spatial size of the representation. Thus, it reduces the number of parameters and the amount of computational in the network. Pooling layer is also aimed at controlling over-fitting.

Figure 8: Type of pooling layer

There are two main types of pooling layers: max pooling returns the maximum value from the specific pool size (2x2) and average pooling returns the average of all values from the specific pool size, which can be seen from figure 8.

Max pooling performs as a Noise Suppressant, whereas average pooling performs dimensional reduction as noise suppressing mechanism. Hence, max pooling performs a lot better than the average pooling, and is the most common pooling layer to implement.

After going through the above max pooling layer, we have enabled the ConvNet model to understand the image features. Next part, we will feed it to the fully connected neural network for classification task.

However, before feeding the output of pooling layer to the fully connected layer, we need a middle layer to transform the dimension of the data for classification task in FC layer, this middle layer is called Flatten layer.

Flatten Layer

Figure 9: Flatten layer in action

As mention earlier and the name suggested, this layer will flatten / convert the multi-dimensional arrays into a single long continuous linear vector. In more technical term, it breaks the spatial structure of the data and transform the multi-dimensional tensor into mono-dimensional tensor, hence a vector.

Dense / Fully connected Layer

Each neuron receives input from all neurons in the previous layer, thus densely connected. The layer has a weight matrix 𝑊, a bias vector 𝛽, and the activation of previous layer 𝑎. Example can be seen from figure 10a.

Dense implements the operation of:output = activation(dot(input, kernel) + bias) .


Figure 10: Dropout method (Source: Giorgio Roffo, Ranking to Learn and Learning to Rank: On the Role of Ranking in Pattern Recognition Applications)

Actually dropout is a regularization method, during training, some of numbers of layer outputs are randomly ignored (dropped out, switched out). This implementation will has the effect of making layers be treated-like different layers with different number of nodes and connectivity to the prior layer.

In convolutional network, dropout is usually implemented in the fully-connected layers and not the convolution layers.

Python Implementation on Kuzushiji-MNIST

In the previous section, the concepts, definitions of all relevant layers are provided. I will combine those concepts and implement the ConvNet from scratch using keras to classify the Kuzushiji-MNIST¹ in Python language. I will demonstrate how we can write our own callbacks object to use in the model as well.

KMNIST is drop-in replacement for the MNIST data set, it also represents ten classes of Hiragana characters. The given data set has 60,000 training images and 10,000 testing images. All pixels have 28x28 pixels. Each class is distributed evenly (ten class with 6,000 images each).

Figure 11: Kuzushiji-MNIST

The label is provided as numeric number from 0 to 9, with each number represents different Hiragana character (i.e. 0: お, 1: き). The images are also provided in numpy array format with shape of (60000, 28, 28) .

We need to do several input transformation both the images (scaled to 0 and 1) and labels (categorical encoding) data set. The most effective way is to implement function to handle as we can apply it on both training and testing data. I will also subset the training data further to create validation data set.

The training, validation and testing data set consist of 48,000, 12,000 and 10,000 images respectively with the new shape of (28, 28, 1).

As shown in the below code snippet,

  1. we will define the Sequential() model in Keras and add layers to build up the ConvNets. The last layer of the fully-connected is the output layer with 10 output neurons (same number as the class of images). The softmax activation will give you the probabilities of each class label.
model.add(Dense(NUM_CLASS, activation='softmax'))

2. Choosing loss, optimizer, and metrics to be used in the model

3. Define callbacks function

4. Train and evaluate model performance

What is Callbacks function?

A callback is a set of functions to be applied at given stages of the training procedure. You can use callbacks to get a view on internal states and statistics of the model during training.

Ultimately, it can help you fix bugs, build better models, and keep track on model’s training. It can use to visualize how the training is going, help prevent over-fitting by using early stopping and save the best model’s weight automatically.

Figure 12: The 7th round epoch during training

In this demonstration, I manually create callbacks class to keep auc score for each epoch and use ModelCheckPoint and EarlyStopping to save the best weights of the model (based on loss score on validation data set) and stop train the model (if loss score on validation data set stops improving).

Figure 12 shows the callbacks in action, it prints AUC score during each epoch and save the best weight to file.

We then compute the model performance against the test images. The history (or callback) from previous steps can be feed to visualize how the model learns in each epoch.

Figure 13: CNN from scratch model performance

Thanks for reading up to this point. In this article, I started from walkthrough what is ConvNet, and how it works. The layers for the ConvNet architecture are also explained. I close out this article by implementing the ConvNet from scratch using Keras and write our own callback functions, which further help the model development process.

If you have any questions, please feel free to comment or reach me via LinkedIn here.

[1]: “KMNIST Dataset” (created by CODH), adapted from “Kuzushiji Dataset” (created by NIJL and others), doi:10.20676/00000341

Satsawat Natakarnkitkul

Written by

The AI / ML guy, senior data scientist #keeplearning #datascience

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade