13 min readJun 1, 2018


Image from MDPI Entropy

How I built a Convolutional Image classifier using Tensorflow from Scratch. (Without using Dogs Vs Cats, From getting images from google to saving our trained model for reuse.)

I’m just very tired of the same implementation everywhere on the internet.

Though it is from scratch, here I don’t explain the theory because you can get many better explanations online with visualizations too. However, Execution and CNNs are briefly explained.

What are we gonna do:

  • We will build a 3 layered community standard CNN Image classifier to classify whether the given image is an image of Batman or Superman.
  • Learn how to build a model from scratch in TensorFlow.
  • How to train and test it.
  • How to save and use it further.

Tl;Dr: Procedure and code in Github.

Collecting Data:

Google Images Downloader. This is what I’ve used and it’s fast, easy, simple and efficient.

My Google image search screenshot image

I’ve collected 300 images each for Supes and Batsy respectively, But more data is highly preferable. Try to collect as much clean data as possible.


As I said, 300 is not a number at all in Deep learning. So, we must Augment the images to get more images from whatever we collected. You can use the following to do it easily, Augmentor. The code that I’ve used is in Github that is mentioned at the end.

Screenshot of my files in my computer

Same Image, Augmented using various transformations. I have had 3500 images each after augmentation.

Careful: While Augmenting, be careful about what kind of transformation you use. You can mirror flip a Bat Logo but cannot make it upside down.


Once we augmented our data, we need to standardize it. We convert all the images to the same format and size.

resizing image by me


  • Make a folder named rawdata in the current working directory.
  • Create folders with their respective class names and put all the images in their respective folders.
  • Run this file in the same directory as rawdata.
  • This will resize all the images to a standard resolution and same format and put it in a new folder named data

Note: As I embedded it in, it is unnecessary to run it explicitly.

Update: I’ve added the data folder itself online found here. Just download and extract in the same folder as the project.

Getting Serious:

Okay, till now it’s just scripting work. I haven’t gone into details since the steps are rudimentary. From now on I will go step by step with an explanation of what I’m doing in the code.

Building a model:

Don’t let it fool you with its complex behaviour, we have at least a billion times complicated thing sitting on top of our head.

From Scratch:

  • I just created a file named
  • Create a class name model_tools with following functions:
class model_tools:
# Defined functions for all the basic tensorflow components that we needed for building a model.

def add_weights(self):
return 0
def add_biases(self):
return 0
def conv_layer(self):
return 0
def pooling_layer(self):
return 0
def flattening_layer(self):
return 0
def fully_connected_layer(self):
return 0
def activation_layer(self):
return 0

Now we are gonna define every function with its parameters,


Parameters: shape.

It will return a connection of the given shape with some random initialised values whenever it is called. tf.truncated_normal is used to generate more randomized initial values for that shape.

def add_weights(self,shape):    # a common method to create all sorts of weight connections
# takes in shapes of previous and new layer as a list e.g.
# starts with random values of that shape.
return tf.Variable(tf.truncated_normal(shape=shape,stddev=0.05))


Parameters: shape

Biases are initialised with some constant for that shape. Returns bias variable.

def add_biases(self,shape):
# a common method to add create biases with default=0.05
# takes in shape of the current layer e.g. x=10
return tf.Variable(tf.constant(0.05, shape=shape))


Parameters: layer, kernel, input_shape, output_shape, stride_size.

Strides: Think of these as jump values for the sliding window in the convolutional map.

Convolution occurs here.

layer - takes in last layer.
kernel - kernel size for convoluting on the image.
input_shape - size of the input image.
ouput_shape - size of the convoluted image.
stride_size - determines kernel jump size
def conv_layer(self,layer, kernel, input_shape, output_shape, stride_size):
weights = self.add_weights([kernel, kernel, input_shape,
biases = self.add_biases([output_shape])
stride = [1, stride_size, stride_size, 1]
#does a convolution scan on the given image
layer = tf.nn.conv2d(layer, weights, strides=stride,
padding='SAME') + biases
return layer

More explanation is given in the Architecture section.


Parameters: previous_layer, kernel, stride

def pooling_layer(self,layer, kernel_size, stride_size):
# basically it reduces the complexity involved by only taking the important features alone
# many types of pooling is there.. average pooling, max pooling..
# max pooling takes the maximum of the given kernel
kernel = [1, kernel_size, kernel_size, 1]
#stride=[image_jump,row_jump,column_jump,color_jump]=[1,2,2,1] mostly
stride = [1, stride_size, stride_size, 1]
return tf.nn.max_pool(layer, ksize=kernel, strides=stride, padding='SAME')


Parameters: layer

As the name says, it converts all multidimensional matrices into a single dimension.

def flattening_layer(self,layer):
#make it single dimensional
input_size = layer.get_shape().as_list()
new_size = input_size[-1] * input_size[-2] * input_size[-3]
return tf.reshape(layer, [-1, new_size]),new_size


Parameters: the previous layer, the shape of the previous layer, the shape of the output layer.

This is a vanilla layer. It connects the previous layer with the output layer. Here is where the mx+b operation occurs.

def fully_connected_layer(self,layer, input_shape, output_shape):
#create weights and biases for the given layer shape
weights = self.add_weights([input_shape, output_shape])
biases = self.add_biases([output_shape])
#most important operation
layer = tf.matmul(layer,weights) + biases # mX+b
return layer


Parameters: layer

we use Rectified linear unit Relu. it's the standard activation layer used.
There are also other layers like sigmoid,tanh..etc. but ReLU is more efficient.
function: 0 if x<0 else x.

def activation_layer(self,layer):
return tf.nn.relu(layer)


Now we have to put all the elements that we have seen above in a way to make it work for us.

A neural network is a black box, we won’t have any control over what happens inside those connections. but at each layer, we can get insights through which it is possible for us to calculate what combination of sequence of these functions will give us good results.

As I said, we are going to build a really standard system. So, we can use a standard architecture which is found in most successful models.

A Simple Architecture:

Model Architecture of the classifier by me

#level 1 convolution

#level 2 convolution

#level 3 convolution

#flattening layer

#fully connected layer
#output layer

A Brief Architecture:

In our architecture, we have 3 convolutional layers. I chose 3 because it seemed like an optimum choice for a small classifier. There are no rules for the size or dimensions of each convolutional layers.

Custom drawn image

So, what does the above architecture really mean to you?

Explaining Convolutional Layers:

The last three layers are no rocket science, it is self- explanatory. So, let's talk about those convolutional layers. You can see the dimensional change in each convolutional layer.

Take an image. Now we are going to define this single image as 16 features for the first convolution of 50 x 50 height and width.

  • Okay, why 16? well, it doesn’t have any particular reason. It just works well like in most architectures. So, what this intuitively means is when you put back all the 16 features, you’ll get your image back.
  • Okay, what are those 16 features and how to select them? hmm, remember people say Neural networks are black boxes? we are gonna see it now. Those 16 features are not defined by us and we don’t select any particular feature. It is inside the black box and we don’t have control over it.
  • Okay, inferences at least? yeah, we can have inferences but it’s just not humanly readable. It just learns whatever it sees through those pictures and we can’t reason with it. But to explain it, say feature define a feature of the object in the image. Like, say a feature may be colour, edges, corners, curves, shapes, transitions etc.
  • e.g: Take a dog, you can define a dog by its colour brown, black, white it doesn’t come in blue, green or red. It has four legs, hair, ears, face, height, tail and many other features. So, if all of these features are present, then you can confidently say it’s a dog.
CNN kernel at each layer drawn by me
  • Why 3 convolutional layers? well, more complex and larger the image is, we need more features to define it. But, you cannot breakdown a large image into n- features directly. It won’t be effective because the features won’t connect with each other due to the vastness of the image. So, it is good to level down and get feature maps as we go.
  • So, what is it really learning?... It is learning which set of features define an object. In layer 2, which set of features defines these features in layer 1. Same goes for all the layers in the network.
  • As we go deeper, we reduce the size of the feature map and increase the number of features. So when you think of it, a group of points, edges, corner features forms a particular shape. A group of shapes, transitions, colours, the pattern features forms a leg. A Group of leg features in that image along with head, body, colour, tail features form a dog. So, remember a dog is convoluted into points and edges.

Done with it:

Okay, I’ve run out of patience. I’m sure you have too. So, let's jump straight without so much explanation.


  • We have built our network. Now it is time to pass in some data and get those neurons fired.
  • We have 1000s of images. Even though they are small in size, it is complex enough as it goes deep. So, we divide our images into small batches and send them to network.
  • One complete cycle of all the images passed through the network remarks an epoch. Our network cannot learn all the features of an image at once, it needs to see it multiple times and also compare it all the other images that it has seen and decide which set of features of the image made it as a class A image or a class B image.
  • “Show and Teach” We show the network this is an image of a dog and ask it to learn features of these over iterations and comparing with the original image. Comparison is nothing but how different the predicted value is to the expected output. we calculate that using Squared Error.
  • We are going to use softmax cross-entropy which is basically like finding squared error but this is more efficient and better.
  • We found errors for individual images now we can average it and get the total error rate. It is also known as cost.
  • Now, we need to reduce this cost using some learning technique. Reducing the cost means what particular set of neurons should be fired in order that error is minimum. So, we have many variables(neurons) which should be optimized. There are many optimizers but it all began with the virtuous Gradient Descent.
  • We are going to use an advanced technique as Gradient descent is old and slow. Adam Optimizer, It is almost the best choice in all kinds of networks and optimization.
  • So, the image placeholder will have the images for that batch size and we are going to run our network using the Adam Optimizer with our image data. The code is given below with an explanation of comments:
def trainer(network,number_of_images):
#find error like squared error but better

#now minize the above error
#calculate the total mean of all the errors from all the nodes

#Now backpropagate to minimise the cost in the network.
for epoch in range(epochs):
tools = processing_tools()
for batch in range(int(number_of_images / batch_size)):
images, labels = tools.batch_dispatch() if images == None:break loss =[cost], feed_dict=
{images_ph: images, labels_ph: labels})
print('loss', loss), feed_dict={images_ph: images,
labels_ph: labels})
print('Epoch number ', epoch, 'batch', batch,

Actual Training itself:

  • Clone this repo.
  • Augment the images using Augmentor that is mentioned above.
  • Put the images in their respective folders in rawdata.
rawdata/batman: 3810 images
rawdata/superman: 3810 images

Update: If you want to train it with the same data, I’ve uploaded the data folder here. Just download and extract in the same folder.

Our file structure should look like this,

Folders and File structure screenshot by me

data folder will be generated automatically by from raw_data if data folder does not exist.


If you want to edit something, you can do it using the file.

raw_data='rawdata'data_path='data'height=100width=100all_classes = os.listdir(data_path)number_of_classes = len(all_classes)color_channels=3epochs=300batch_size=10model_save_name='checkpoints\\'
  • Run
  • Wait for a few hours.
  • For me, it took 8 hrs for 300 epochs. I did it on my laptop which has i5 processors, 8 Gigabytes of RAM, Nvidia Geforce 930M 2GB setup. You can end the process anytime if saturated, as the model will be saved frequently.
Training screenshots in my system

Feel free to play with the variables.

Saving our model:

Once training is over, we can see a folder named checkpoints is created which contains our model for which we trained. These two simple lines do that for us in TensorFlow:

saver = tf.train.Saver(max_to_keep=4), model_save_name)

You can get my pre-trained model here.

Yeah, you’ve done it.

Yes, you have built your own accurate image classifier using CNNs from scratch. Now, let’s get the results of what we built.

To do that, we need a script that can run our model and classify the image.

We have three files in our checkpoints folder,

  • .meta file — it has your graph structure saved.
  • .index — it identifies the respective checkpoint file.
  • .data — it stores the values of all the variables.

How to use it?

Tensorflow is so well built that, it does all the heavy lifting for us. We just have to write four simple lines to load and infer our model.

#Create a saver object to load the model
saver = tf.train.import_meta_graph
#restore the model from our checkpoints folder
#Create graph object for getting the same network architecture
graph = tf.get_default_graph()
#Get the last layer of the network by it's name which includes all the previous layers too
network = graph.get_tensor_by_name("add_4:0")

Yeah, simple. Now that we got our network as well as the tuned values, we have to pass an image to it using the same placeholders(Image, labels).

im_ph= graph.get_tensor_by_name("Placeholder:0")
label_ph = graph.get_tensor_by_name("Placeholder_1:0")

If you run it now, you can see the output as [1234,-4322] like that. While this is right as the maximum value index represents the class, this is not as convenient as representing it in 1 and 0. Like this [1,0]. For that, we should include a line of code before running it,


While we could have done this in our training architecture itself and nothing would have changed, I want to show you that, you can add layers to our model even now, even in the prediction stage. Flexibility.

Inference time:

Your training is nothing, if you don’t have the will to act.

— Ra’s Al Ghul.

To run a simple prediction,

  • Edit the image name in
  • Download the model files and extract in the same folder.
  • Run
labels = np.zeros((1, 2))
# Creating the feed_dict that is required to be feed the io:
feed_dict_testing = {im_ph: img, label_ph: labels}, feed_dict=feed_dict_testing)

You can see the results as [1,0](Batman), [0,1](Superman) corresponding to the index.

please note that this is not output in one-hot encoding.


It is actually pretty good. It is almost right all the time. I even gave it an image with both Batman and Superman, it actually gave me values which are almost of the same magnitude(after removing the sigmoid layer that we added just before).

result screenshot from my computer

From here on you can do whatever you want with those values. Initially loading the model will take some time(70 seconds) but once the model is loaded, you can put a for loop or something to throw in images and get output in a second or two!


I have added some additional lines in the training code for Tensorboard options. Using Tensorboard we can track the progress of our training even while training and after. You can also see your network structure and all the other components inside it. It is very useful for visualizing things happening.

To start it, just go to the directory and open command line,

tensorboard --logdir checkpoints

You should see the following,

Tensorboard screenshot by me

Now type the same address in your browser. Your Tensorboard is now started. Play with it.

Graph Structure Visualization:

Yeah, you can see our entire model with dimensions in each layer and operations here!

Tensorboard Screenshot from my browser

Future Implementations:

While this works for Binary classification, it will also work for Multiclass classification but not as well. We might need to alter the architecture and build a larger model depending on the number of classes we want.

And Batman wins!!!

Screenshot From The Lego Batman Movie

My other works,

Any suggestions, doubts, clarifications please raise an issue in Github.

Thank you!




Computer Vision & Deep Learning Developer, Ex-Udacity AI Mentor, Graduate at ASU, LinkedIn,