Implementing Pokedex from scratch Part I

In my last post, I was trying to classify Pokemon cards by their type. The results were pretty good but I didn’t really understand what I was doing. I was training an MLP neural network written in sklearn to do the classification but had no idea what was happening behind the scenes.

I took a week or so to study the basics of machine learning and the internals of each model. Usually, when I’m studying something new I like to use it for real purpose and I was looking for a cool project.

Thankfully my nephew (Yali) gave me the idea. After abandoning the Pokemon cards, he really got into the Pokemon TV show. Yali seems to be a very curious kid and he wants to know as much as possible about each Pokemon he sees. Most of my Pokemon knowledge is forgotten so this time I cannot act as a human Pokedex. Being the good uncle I am, I started to write my own Pokedex from scratch using only Numpy.

The idea is to give Yali an app where the only thing he is required to do is to point the camera towards a Pokemon picture (toy or a card) and the app immediately display who the Pokemon is and some details about him.

Being an ML guru after a week of studying I decided to use a convolutional neural network, this type of network is best for working with images.

Convolutional Neural Network

This is how a typical convolution network looks like

As you can see from the picture, Convolution network has multiple steps (convolution, activation function, pooling, dropout, and output). The way I see it, all of them are basically steps that have an n-dimensional array as input and m-dimensional array as output.

The development process can be divided into two main issues. Forward propagation and back propagation. Forward propagation is the prediction of a given input and the back propagation is the part in which we study and make our network better at recognizing images.

Sometime mid-working on the network I stumbled across Keras, and I was amazed to see how my ideas were similar to Keras’s. Keras is an amazing library but I wanted to write my own one, mainly for learning purposes.

Convolution Layer

The goal of the convolution layer is to mark the parts that explain the picture the best. Using his filters (which we are studying) the layer is traveling through the picture and scores each section (we are choose how big the section is). Each filter looks for something different in the picture and scores it according to that filter.

A good example is a filter that is trying to recognize car wheels. The layer will travel through the picture of the car and score how each section looks like a wheel (the score is the sum of the products of the section matrix multiplied with the feature matrix).

To calculate I*K matrix we need to have all of the sub matrices from matrix I. The iteration starts from the top left value of matrix I. Each iteration we are going to skip n values (n is given as a parameter name stride) until value to read.

Taking a Charizard card, this is one of the layers after the convolution. It seems like this filter is removing all but the background of the Pokemon.

Activation Layer

Like in a human brain, some neurons can be fired and some can not. The activation layer is helping us simulate it.

For example, having activation function A, a neuron will fire if the result of A(x) is higher then a given threshold. Read this article for better understatement of the activation layer and different possible functions to use.

Having the idea that the framework should be very generic and easy to expand, each activation is implementing only the equations, this way the user will be able to implement his own activation layers (if I didn’t implement it already)

Pooling Layer

The pooling layer is designed to downsample the data. The reason we want to downsample the data is to reduce the number of the parameters and by that control the overfitting.

There are many ways to pool the data and the most common one is max pooling. Taking each time different part of the matrix, only the highest number is passing to the output.

Always thinking about the generic network idea, each implementation of pooling is derived from a basic pooling class. The basic pooling class needs to be given the stride size (how many items to pool), and each implementation will pool in its way.

Using the same Charizard card as before, this is the card before and after the pooling process. You can see how all the important features of the picture are still there (the white parts).

Output Layer

The output layer is giving us the final prediction. The size of the output layer output is the size of possible classes, one neuron for each class. The value of the output neuron is the probability of the class to be in the image. Taking only the animals with score higher then some threshold gives us the prediction.

For example, if we are trying to identify animals in a picture, the value of a neuron will be how sure is the network that the animal that this neuron is representing is really in the image.

Taking all animals with score higher then 0.5, for example, gives us great results.

Learning the filters

If I had to explain very briefly how machine learning works, I would say that the main goal is to find how wrong the prediction is, and try to minimize the error.

Using a loss function we have a measurement of how far the prediction is from the real tag. There are many different types of lost functions and each one of them is best for a different scenario. For the minimization of the cost function, we are going to have a derivative with respect of the filters (the filters are the only factor that can be changed).

Having the result of the derivative we know what direction we need to go to minimize the error. Because the only thing that we can change is the filters (You can’t change the input picture), we are going to change them according to the derivative we just calculated.

The process of updating the filters by the error is called Gradient descent. Each sample updates the neuron filter so the next time the inputs will be processed, the results will be closer to the target. Having multiple samples with different types, we need a way not to change drastically the filters because we can miss the minimum of the cost function

You can see how close is the prediction line to the targets as the error is getting closer to its minimum value.

Think of it like navigate somewhere when you can’t walk and look at the map at the same time. Probably you will take a look at the map, see what direction you need to be walking, and after a while do it again. Checking your map too frequently will cost you time, but If you won’t look to frequently you can pass by your target. The definition of how frequently you look at the map is called learning rate.

All written above is good enough if we had Linear regression model (only one layer with filters to update). Using deep networks we have multiple layers and each one of them has filters that need to be updated. After updating its own filters, each neuron is “telling” his connected neurons how wrong was he and the connected neurons will update their filters using that error. This process is going to run until we hit the input layer (the input layer is not connected to any previous layer). This is called Backpropagation, going from the end to start and update each layer filters.

You can visualize it as each layer is passing his previous layer all his neurons errors:

Or you can visualize it like each neuron is passing his connected neurons his own error:

Using the chain rule we are going to calculate the error for each layer. After calculating the error of the output layer (first layer from the end), we are going to propagate the error back to the previous layer. This layer in its turn will use the error to calculate his own error, update his filters and propagate his own error to the previous layer.

The first delta (error) is the derivative of the error of the cost function. Lets mark z as the values of the neurons before activation function a (z is the values of the output of the previus layer multiplied with the layer filters).

Having all the data we need, we can calculate the first delta. Now from the end of the layer to the start, each layer will use that delta, update his own filters, calculate a new delta and propagate it to the next layer.


Using the delta from cost function, we can update the filters connected to each output neuron.

Each filter is calculating its own error. Taking his related neurons errors and multiply them by the input he got, the derivative of the activation function and the learning. Using this new error the layer is updating the filters and passing on the new error.


The basic idea of the convolution backdrop is the same as the output layer, but convolution is not fully connected network where all neurons are connected to the next layer, each neuron can be connected to different neurons in the next layer so multiply delta like we did in the output layer is not possible.

We are going to calculate two things here. The error we are going to propagate, and the error in which we are going to change the filter.

To calculate the propagated error we will calculate the convolution between the filter matrix and the delta matrix

Taken from a post by Grzegorz Gwardys

And to calculate the error in which the filter needs to be changed, we are going to calculate the convolution of the input matrix and the delta matrix. Before updating the filter this error will be multiplying by the learning rate I was mention earlier.

Taken from a post by Grzegorz Gwardys


Pooling has no filters, so nothing to learn here. But pooling is decreasing the number of neurons, so the propagated delta needs to change too (delta needs to be the same shape as the outputs). For example having output layer with convolution, max pooling, and a output layer. The delta from the output layer is not the same size as the convolution filters and the convolution can’t use that delta.

The backpropagation of the pooling layer is going throw the input elements. If the element was the one who pooled, we are going to insert a delta over there, otherwise, 0 will be pushed.

s w

Test drive

Before classifying the first generation Pokemons (151 of them) I wanted to test the network with something easier. Having all the cards from the previous post already downloaded, I’m going to have another Pokemon cards classifier.

For the purpose of the test, I resized the images to 50x50. The results were very good but the step images (the inputs image in each step) were too small for the human eye to recognize something out of it. I wanted to post the images to show you what the network is really does so I had to fit it again, this time with bigger images.

Fitting bigger images with bigger features I noticed I have a problem. Fitting the model using only one picture should always converge to cost 0. The model is learning to identify the same image over and over again and at the end will be overfitted and will be very good at recognizing this specific image.

Comparing my results to Keras’s my suspicions were right. Using only one hidden layer, starting with the same initial features and using the same learning rate should give the same cost functions value, but it wasn’t.

After reading more on each layer, reviewing other peoples code, debugging Keras and trying my own shticks I was able to fix my network.


For the classification I built the next network

After few hours of fitting, I got desired results. The cool thing (don’t judge me) is that we can see how each step looks like when the input is a Charizard card.

Input image

First of all the image of the card is transformed into a 3-dimensional matrix (red, green and blue. You can see the red layer is mostly white because the card is mostly red and 0 stands for white).

The deeper we go in the network, the less we can understand what are the images stand for.

You can find the source code and the Pokemon example here.

As our hero wrote his own network and got the CNN badge, He moves on to his next adventure and implements a working Pokedex.