Image from MDPI Entropy
  • We will build a 3 layered community standard CNN Image classifier to classify whether the given image is an image of Batman or Superman.
  • Learn how to build a model from scratch in TensorFlow.
  • How to train and test it.
  • How to save and use it further.

Collecting Data:

Google Images Downloader. This is what I’ve used and it’s fast, easy, simple and efficient.

My Google image search screenshot image


As I said, 300 is not a number at all in Deep learning. So, we must Augment the images to get more images from whatever we collected. You can use the following to do it easily, Augmentor. The code that I’ve used is in Github that is mentioned at the end.

Screenshot of my files in my computer


Once we augmented our data, we need to standardize it. We convert all the images to the same format and size.

resizing image by me


  • Make a folder named rawdata in the current working directory.
  • Create folders with their respective class names and put all the images in their respective folders.
  • Run this file in the same directory as rawdata.
  • This will resize all the images to a standard resolution and same format and put it in a new folder named data

Getting Serious:

Okay, till now it’s just scripting work. I haven’t gone into details since the steps are rudimentary. From now on I will go step by step with an explanation of what I’m doing in the code.

Building a model:

Don’t let it fool you with its complex behaviour, we have at least a billion times complicated thing sitting on top of our head.

  • I just created a file named
  • Create a class name model_tools with following functions:
class model_tools:
# Defined functions for all the basic tensorflow components that we needed for building a model.

def add_weights(self):
return 0
def add_biases(self):
return 0
def conv_layer(self):
return 0
def pooling_layer(self):
return 0
def flattening_layer(self):
return 0
def fully_connected_layer(self):
return 0
def activation_layer(self):
return 0


Parameters: shape.

def add_weights(self,shape):    # a common method to create all sorts of weight connections
# takes in shapes of previous and new layer as a list e.g.
# starts with random values of that shape.
return tf.Variable(tf.truncated_normal(shape=shape,stddev=0.05))


Parameters: shape

def add_biases(self,shape):
# a common method to add create biases with default=0.05
# takes in shape of the current layer e.g. x=10
return tf.Variable(tf.constant(0.05, shape=shape))


Parameters: layer, kernel, input_shape, output_shape, stride_size.

layer - takes in last layer.
kernel - kernel size for convoluting on the image.
input_shape - size of the input image.
ouput_shape - size of the convoluted image.
stride_size - determines kernel jump size
def conv_layer(self,layer, kernel, input_shape, output_shape, stride_size):
weights = self.add_weights([kernel, kernel, input_shape,
biases = self.add_biases([output_shape])
stride = [1, stride_size, stride_size, 1]
#does a convolution scan on the given image
layer = tf.nn.conv2d(layer, weights, strides=stride,
padding='SAME') + biases
return layer


Parameters: previous_layer, kernel, stride

def pooling_layer(self,layer, kernel_size, stride_size):
# basically it reduces the complexity involved by only taking the important features alone
# many types of pooling is there.. average pooling, max pooling..
# max pooling takes the maximum of the given kernel
kernel = [1, kernel_size, kernel_size, 1]
#stride=[image_jump,row_jump,column_jump,color_jump]=[1,2,2,1] mostly
stride = [1, stride_size, stride_size, 1]
return tf.nn.max_pool(layer, ksize=kernel, strides=stride, padding='SAME')


Parameters: layer

def flattening_layer(self,layer):
#make it single dimensional
input_size = layer.get_shape().as_list()
new_size = input_size[-1] * input_size[-2] * input_size[-3]
return tf.reshape(layer, [-1, new_size]),new_size


Parameters: the previous layer, the shape of the previous layer, the shape of the output layer.

def fully_connected_layer(self,layer, input_shape, output_shape):
#create weights and biases for the given layer shape
weights = self.add_weights([input_shape, output_shape])
biases = self.add_biases([output_shape])
#most important operation
layer = tf.matmul(layer,weights) + biases # mX+b
return layer


Parameters: layer

def activation_layer(self,layer):
return tf.nn.relu(layer)


Now we have to put all the elements that we have seen above in a way to make it work for us.

A Simple Architecture:

Model Architecture of the classifier by me

#level 1 convolution

#level 2 convolution

#level 3 convolution

#flattening layer

#fully connected layer
#output layer

A Brief Architecture:

In our architecture, we have 3 convolutional layers. I chose 3 because it seemed like an optimum choice for a small classifier. There are no rules for the size or dimensions of each convolutional layers.

Custom drawn image

Explaining Convolutional Layers:

The last three layers are no rocket science, it is self- explanatory. So, let's talk about those convolutional layers. You can see the dimensional change in each convolutional layer.

  • Okay, why 16? well, it doesn’t have any particular reason. It just works well like in most architectures. So, what this intuitively means is when you put back all the 16 features, you’ll get your image back.
  • Okay, what are those 16 features and how to select them? hmm, remember people say Neural networks are black boxes? we are gonna see it now. Those 16 features are not defined by us and we don’t select any particular feature. It is inside the black box and we don’t have control over it.
  • Okay, inferences at least? yeah, we can have inferences but it’s just not humanly readable. It just learns whatever it sees through those pictures and we can’t reason with it. But to explain it, say feature define a feature of the object in the image. Like, say a feature may be colour, edges, corners, curves, shapes, transitions etc.
  • e.g: Take a dog, you can define a dog by its colour brown, black, white it doesn’t come in blue, green or red. It has four legs, hair, ears, face, height, tail and many other features. So, if all of these features are present, then you can confidently say it’s a dog.
CNN kernel at each layer drawn by me
  • Why 3 convolutional layers? well, more complex and larger the image is, we need more features to define it. But, you cannot breakdown a large image into n- features directly. It won’t be effective because the features won’t connect with each other due to the vastness of the image. So, it is good to level down and get feature maps as we go.
  • So, what is it really learning?... It is learning which set of features define an object. In layer 2, which set of features defines these features in layer 1. Same goes for all the layers in the network.
  • As we go deeper, we reduce the size of the feature map and increase the number of features. So when you think of it, a group of points, edges, corner features forms a particular shape. A group of shapes, transitions, colours, the pattern features forms a leg. A Group of leg features in that image along with head, body, colour, tail features form a dog. So, remember a dog is convoluted into points and edges.

Done with it:

Okay, I’ve run out of patience. I’m sure you have too. So, let's jump straight without so much explanation.


  • We have built our network. Now it is time to pass in some data and get those neurons fired.
  • We have 1000s of images. Even though they are small in size, it is complex enough as it goes deep. So, we divide our images into small batches and send them to network.
  • One complete cycle of all the images passed through the network remarks an epoch. Our network cannot learn all the features of an image at once, it needs to see it multiple times and also compare it all the other images that it has seen and decide which set of features of the image made it as a class A image or a class B image.
  • “Show and Teach” We show the network this is an image of a dog and ask it to learn features of these over iterations and comparing with the original image. Comparison is nothing but how different the predicted value is to the expected output. we calculate that using Squared Error.
  • We are going to use softmax cross-entropy which is basically like finding squared error but this is more efficient and better.
  • We found errors for individual images now we can average it and get the total error rate. It is also known as cost.
  • Now, we need to reduce this cost using some learning technique. Reducing the cost means what particular set of neurons should be fired in order that error is minimum. So, we have many variables(neurons) which should be optimized. There are many optimizers but it all began with the virtuous Gradient Descent.
  • We are going to use an advanced technique as Gradient descent is old and slow. Adam Optimizer, It is almost the best choice in all kinds of networks and optimization.
  • So, the image placeholder will have the images for that batch size and we are going to run our network using the Adam Optimizer with our image data. The code is given below with an explanation of comments:
def trainer(network,number_of_images):
#find error like squared error but better

#now minize the above error
#calculate the total mean of all the errors from all the nodes

#Now backpropagate to minimise the cost in the network.
for epoch in range(epochs):
tools = processing_tools()
for batch in range(int(number_of_images / batch_size)):
images, labels = tools.batch_dispatch() if images == None:break loss =[cost], feed_dict=
{images_ph: images, labels_ph: labels})
print('loss', loss), feed_dict={images_ph: images,
labels_ph: labels})
print('Epoch number ', epoch, 'batch', batch,

Actual Training itself:

  • Clone this repo.
  • Augment the images using Augmentor that is mentioned above.
  • Put the images in their respective folders in rawdata.
rawdata/batman: 3810 images
rawdata/superman: 3810 images
Folders and File structure screenshot by me
raw_data='rawdata'data_path='data'height=100width=100all_classes = os.listdir(data_path)number_of_classes = len(all_classes)color_channels=3epochs=300batch_size=10model_save_name='checkpoints\\'
  • Run
  • Wait for a few hours.
  • For me, it took 8 hrs for 300 epochs. I did it on my laptop which has i5 processors, 8 Gigabytes of RAM, Nvidia Geforce 930M 2GB setup. You can end the process anytime if saturated, as the model will be saved frequently.
Training screenshots in my system

Saving our model:

Once training is over, we can see a folder named checkpoints is created which contains our model for which we trained. These two simple lines do that for us in TensorFlow:

saver = tf.train.Saver(max_to_keep=4), model_save_name)

Yeah, you’ve done it.

Yes, you have built your own accurate image classifier using CNNs from scratch. Now, let’s get the results of what we built.

  • .meta file — it has your graph structure saved.
  • .index — it identifies the respective checkpoint file.
  • .data — it stores the values of all the variables.
#Create a saver object to load the model
saver = tf.train.import_meta_graph
#restore the model from our checkpoints folder
#Create graph object for getting the same network architecture
graph = tf.get_default_graph()
#Get the last layer of the network by it's name which includes all the previous layers too
network = graph.get_tensor_by_name("add_4:0")
im_ph= graph.get_tensor_by_name("Placeholder:0")
label_ph = graph.get_tensor_by_name("Placeholder_1:0")

Inference time:

Your training is nothing, if you don’t have the will to act.

— Ra’s Al Ghul.

To run a simple prediction,

  • Edit the image name in
  • Download the model files and extract in the same folder.
  • Run
labels = np.zeros((1, 2))
# Creating the feed_dict that is required to be feed the io:
feed_dict_testing = {im_ph: img, label_ph: labels}, feed_dict=feed_dict_testing)


It is actually pretty good. It is almost right all the time. I even gave it an image with both Batman and Superman, it actually gave me values which are almost of the same magnitude(after removing the sigmoid layer that we added just before).

result screenshot from my computer


I have added some additional lines in the training code for Tensorboard options. Using Tensorboard we can track the progress of our training even while training and after. You can also see your network structure and all the other components inside it. It is very useful for visualizing things happening.

tensorboard --logdir checkpoints
Tensorboard screenshot by me

Graph Structure Visualization:

Yeah, you can see our entire model with dimensions in each layer and operations here!

Tensorboard Screenshot from my browser

Future Implementations:

While this works for Binary classification, it will also work for Multiclass classification but not as well. We might need to alter the architecture and build a larger model depending on the number of classes we want.

Screenshot From The Lego Batman Movie



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store