How to easily build a Dog breed Image classification model

James Le
James Le
Mar 15, 2019 · 13 min read

If you’re impatient, scroll to the bottom of the post for the Github Repos

Who’s a good dog? Who likes ear scratches? Well, it seems those fancy deep neural networks don’t have all the answers. However, maybe they can answer that ubiquitous question we all ask when meeting a four-legged stranger: what kind of good pup is that?

In this tutorial, we’ll walk through building a deep neural network classifier capable of determining a dog’s breed from a photo using the Dog Breed dataset. We’ll walk through how to train a model, design the input and output for category classifications, and finally display the accuracy results for each model.

Image Classification

The problem of Image Classification goes like this: Given a set of images that are all labeled with a single category, we are asked to predict these categories for a novel set of test images and measure the accuracy of the predictions. There are a variety of challenges associated with this task, including viewpoint variation, scale variation, intra-class variation, image deformation, image occlusion, illumination conditions, background clutter etc.

How might we go about writing an algorithm that can classify images into distinct categories? Computer Vision researchers have come up with a data-driven approach to solve this. Instead of trying to specify what every one of the image categories of interest looks like directly in code, they provide the computer with many examples of each image class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. In other words, they first accumulate a training dataset of labeled images, then feed it to the computer in order for it to get familiar with the data.

Given that fact, the complete image classification pipeline can be formalized as follows:

  • Our input is a training dataset that consists of N images, each labeled with one of K different classes.
  • Then, we use this training set to train a classifier to learn what every one of the classes looks like.
  • In the end, we evaluate the quality of the classifier by asking it to predict labels for a new set of images that it has never seen before. We will then compare the true labels of these images to the ones predicted by the classifier.

Image Classification Using Machine Learning

A machine learning image processing approach to image classification involves identifying and extracting key features from images and using them as input to a machine learning model. Image classification, technically speaking, is a machine learning method and it is designed to resemble the way a human brain functions. With this method, the computers are taught to recognize the visual elements within an image. By relying on large databases and noticing emerging patterns, the computers can make sense of images and formulate relevant tags and categories.

Convolutional neural networks

This section introduces convolutional neural networks, which are a clever way to reduce the number of parameters. Instead of dealing with a fully connected network, the CNN approach reuses the same parameter multiple times. The big idea behind CNNs is that a local understanding of an image is good enough. The practical benefit is that having fewer parameters greatly improves the time it takes to learn as well as reduces the amount of data required to train the model.

Instead of a fully connected network of weights from each pixel, a CNN has just enough weights to look at a small patch of the image. It’s like reading a book by using a magnifying glass; eventually, you read the whole page, but you look at only a small patch of the page at any given time.

Consider a 256 × 256 image. Instead of processing the whole image at once, CNN can efficiently scan it chunk by chunk — say, a 5 × 5 window. The 5 × 5 window slides along the image (usually left to right, and top to bottom), as shown in the figure below. How “quickly” it slides is called its stride length. For example, a stride length of 2 means the 5 × 5 sliding window moves by 2 pixels at a time until it spans the entire image. This 5 x 5 window has an associated 5 x 5 matrix of weights.

The sliding-window shenanigans happen in the convolution layer of the neural network. A typical CNN has multiple convolution layers. Each convolutional layer typically generates many alternate convolutions, so the weight matrix is a tensor of 5 × 5 × n, where n is the number of convolutions.

As an example, let’s say an image goes through a convolution layer on a weight matrix of 5 × 5 × 64. It generates 64 convolutions by sliding a 5 × 5 window. Therefore, this model has 5 × 5 × 64 (= 1,600) parameters, which is remarkably fewer parameters than a fully connected network, 256 × 256 (= 65,536).

The beauty of the CNN is that the number of parameters is independent of the size of the original image. You can run the same CNN on a 300 × 300 image, and the number of parameters won’t change in the convolution layer!

Dog Breed Dataset

The dataset that we’ll be working on can be accessed here. We are provided a training set and a test set of images of dogs. Each image has a filename that is its unique id. The dataset comprises 120 breeds of dogs. To make it simpler, we’ll reduce the dataset with the 8 main breeds. The tutorial below shows how to use TensorFlow to build a simple CNN with 3 convolutional layers to classify the dog breeds.

Data Processing

1 — Packages

Let’s import all the packages needed.

2 — Unzip Files

We need now to extract the train and test files from the zip. This is the code:

# We unzip the train and test zip filearchive_train = ZipFile("Data/", 'r')archive_test = ZipFile("Data/", 'r')# This line shows the 5 first image name of the train databasearchive_train.namelist()[0:5]# This line shows the number of images in the train database, noted that we must remove the 1st value (column header)len(archive_train.namelist()[:]) - 1

The last line of code should return a value of 10,222.

3 — Resize and normalize data

The function below creates a pickle file to save all the images unzipped.

We then define the new image size applied for all images and call the function above.

image_resize = 60DataBase_creator(archivezip = archive_train, nwidth = image_resize, nheight = image_resize , save_name = “train")DataBase_creator(archivezip = archive_test, nwidth = image_resize, nheight = image_resize , save_name = "test")

If using a laptop with CPU, we should see the time usage to be around 40 seconds for the train zip file and 41 seconds for the test zip file.

We have now a train and test pickle files. Next time we open this code in a Jupyter Notebook, we can load them directly and the step above can be skipped if we relaunch the code later.

# load TRAINtrain = pickle.load( open( "train.p", "rb" ) )train.shape

The shape of the training data should be (10222, 60, 60, 3).

# load TESTtest = pickle.load( open( "test.p", "rb" ) )test.shape

The shape of the test data should be (10357, 60, 60, 3).

All the images do not have the same shape. For our model, we need to resize them to the same shape. We use the common practice to reshape them as a square. We also need to normalize our dataset by dividing by 255 all the pixel values. The new pixels values will be in the range [0,1].

Let’s check one image from the training dataset:

lum_img = train[100,:,:,:]plt.imshow(lum_img)

4 — Check out labels file

Now let’s zoom in on the label CSV file from train data.

labels_raw = pd.read_csv("Data/", compression='zip', header=0, sep=',', quotechar='"')labels_raw.sample(5)

5 — Extract the most represented breeds

We will reduce the database so that we can reduce the complexity of our model. In addition, it will help for the calculation as there will be only N breeds to classify. We will be able to easily run the model in less than 10 minutes.

We should be able to see this output:

Let’s look at one image:

lum_img = train_filtered[1,:,:,:]plt.imshow(lum_img)

6 — One-Hot Labels

Let’s do one-hot encoding for our labels data.

# We select the labels from the N main breedslabels = labels_filtered["breed"].as_matrix()labels = labels.reshape(labels.shape[0],1) #labels.shape[0] looks faster than using len(labels)labels.shape

The labels shape is (922, 1).

labels_name, labels_bin = matrix_Bin(labels = labels)labels_bin[0:9]

7 — Quick check on labels

Let’s see exactly the N labels we keep. As you will see below from the one-hot labels, you can find which breed it corresponds.

for breed in range(len(labels_name)):   print('Breed {0} : {1}'.format(breed,labels_name[breed]))
labels_cls = np.argmax(labels_bin, axis=1)labels[0:9]

Convolutional Neural Networks

1 — Creation of a Train and Validation Data

We split our train data in two: a training set and a validation set. Therefore, we can check the accuracy of the model train made from the ‘training set’, on the validation set.

num_validation = 0.30X_train, X_validation, y_train, y_validation = train_test_split(train_filtered, labels_bin, test_size=num_validation, random_state=6)

2 — Creation of a Train and Test Data

Here’s the code to split original data to train and test sets:

3 — CNN with TensorFlow — Defining Layers

The architecture will be like this:

  • 1st Convolutional Layer with 32 filters
  • Max pooling
  • Relu
  • 2nd Convolutional Layer with 64 filters
  • Max pooling
  • Relu
  • 3rd Convolutional Layer with 128 filters
  • Max pooling
  • Relu
  • DropOut
  • Flatten Layer
  • Fully Connected Layer with 500 nodes
  • Relu
  • DropOut
  • Fully Connected Layer with n nodes (n = number of breeds)

Here’s a brief explanation of these terms:

  • Convolution Layer: As explained in the CNN section above, at this layer, we preserve the spatial relationship between pixels by learning image features using small squares of input data. These squares of input data are also called filters or kernels. The matrix formed by sliding the filter over the image and computing the dot product is called a Feature Map. The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.
  • ReLU Layer: For any kind of neural network to be powerful, it needs to contain non-linearity. ReLU is one such non-linear operation, which stands for Rectified Linear Unit. It is an element-wise operation that replaces all negative pixel values in the feature map by 0. We pass the result from the convolution layer through a ReLU activation function.
  • Max Pooling Layer: After this, we perform a pooling operation to reduce the dimensionality of each feature map. This enables us to reduce the number of parameters and computations in the network, therefore controlling overfitting. CNN uses max-pooling, in which it defines a spatial neighborhood and takes the largest element from the rectified feature map within that window. After the pooling layer, our network becomes invariant to small transformations, distortions and translations in the input image.
  • Fully-Connected Layer: After these layers, we add a couple of fully-connected layers to wrap up the CNN architecture. The output from the convolution and pooling layers represent high-level features of the input image. The FC layers use these features for classifying the input image into various classes based on the training dataset. Apart from classification, adding FC layers also helps to learn non-linear combinations of these features.
  • Dropout Layer: Dropout is a regularization technique to help the network avoid overfitting. Basically during training half of neurons on a particular layer will be deactivated. This improves generalization as you force your layer to learn with different neurons. Normally we use Dropout on the fully connected layers, but it is also possible to use dropout after the max-pooling layers, creating some kind of image noise augmentation.

From a bigger picture, a CNN architecture accomplishes 2 major tasks: feature extraction (convolution + pooling layers) and classification (fully-connected layers). In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize.

Here we define our weights, biases, and other constants.

Here we define our convolution layer.

Here we define our flatten layer.

Here we define our fully-connected layer.

4 — CNN with TensorFlow — Set up placeholder tensor

Here we set up a placeholder for the tensor in TensorFlow.

x = tf.placeholder(tf.float32, shape=[None, img_size, img_size, num_channels], name='x')x_image = tf.reshape(x, [-1, img_size, img_size, num_channels]) #-1 put everything as 1 arrayy_true = tf.placeholder(tf.float32, shape=[None, num_classes], name='y_true')y_true_cls = tf.argmax(y_true, axis=1)keep_prob_fc=tf.placeholder(tf.float32)keep_prob_conv=tf.placeholder(tf.float32)

5 — CNN with TensorFlow — Design the layer

In this part, you can play with the filter sizes and the number of filters. The best model is one with the proper number of layers but also a good choice of filter sizes and the number of filters.

6 — CNN with TensorFlow — Cross-entropy loss

Here we define our loss function to train our model.

cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=layer_fc2, labels=y_true)cost = tf.reduce_mean(cross_entropy)optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(cost)correct_prediction = tf.equal(y_pred_cls, y_true_cls)accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

7 — CNN with TensorFlow — Train the model

Now let’s actually train our neural network!

session = tf.Session()def init_variables():

The function below creates a batch from a dataset. We use a batch to train our model.

init_variables()total_iterations = 0optimize(num_iterations=3500, X=250)

As you can see, the model tends to overfit and is not very good.

8 — CNN with TensorFlow — Results

The results are not so good as the accuracy is only 44%. Using a pre-trained model with Keras will give you a better result but with this model, you will know how to build from scratch your own CNN with TensorFlow.

By having more photos of dogs, we can increase the accuracy. in addition, we can create new images in our training dataset by rotating the images. it’s what we call image augmentation. It will help the model to detect a pattern which can have different ‘position’ in the space.

Below are some functions to show some images from the new test data with the corresponding breeds and the predicted breeds. There is also the confusion matrix to see the results.

Let’s look at some results!

feed_dict_validation = {x: X_validation, y_true: y_validation, keep_prob_conv : 1, keep_prob_fc : 1}df_validation_Predicted_cls =, feed_dict=feed_dict_validation)plot_images(images=X_validation[50:62], cls_true=df_validation_toPred_cls[50:62], cls_pred=df_validation_Predicted_cls [50:62])
i = 63print(("True : {0} / {1}").format(df_validation_toPred_cls[i], labels_name[df_validation_toPred_cls[i]]))print(("Pred : {0} / {1}").format(df_validation_Predicted_cls[i], labels_name[df_validation_Predicted_cls[i]]))lum = X_validation[i,:,:,:]

As you can see, the model has difficulties to differentiate Breed 1: bernese_mountain_dog and Breed 2: entlebucher. These 2 breeds look very similar to each other(same color and shape). So, it looks normal that our model has made some mistakes between these two breeds.

Nanonets makes Transfer Learning easier

Having experienced the accuracy problem with transfer learning, I set out to solve it by building an easy to use cloud-based Deep Learning service that uses Transfer Learning. It contains a set of pre-trained models that have been trained on millions of parameters. I can upload Fashion MNIST data, and then the service selects the best model to use for the task. Lastly, it creates a new NanoNet on top of the existing pre-trained model and fits the NanoNet to the data.

Because the NanoNets models are heavily pre-trained, I used a much smaller training dataset of only ~100 images per class. From this model, I got 83.3% test accuracy. This is 7% more than the VGG19 model in spite of using 1/60th of the data! The reason that NanoNets model performs better is: large amount of pre-training, optimal hyper-parameter selection, and data augmentation.

The great thing about NanoNets is that anyone can upload data and build their own models. You can build models in 2 ways:

1. Using a GUI:

2. Using NanoNets API:

Below, we will give you a step-by-step guide to training your own model using the Nanonets API, in 9 simple steps.

Step 1: Clone the Repo

git clone
cd image-classification-sample-python
sudo pip install requests

Step 2: Get your free API Key

Get your free API Key from

Step 3: Set the API key as an Environment Variable


Step 4: Create a New Model

python ./code/

Note: This generates a MODEL_ID that you need for the next step

Step 5: Add Model Id as Environment Variable


Step 6: Upload the Training Data

Collect the images of the objects you want to detect. Once you have dataset ready in folder images (image files), start uploading the dataset.

python ./code/

Step 7: Train Model

Once the Images have been uploaded, begin training the Model

python ./code/

Step 8: Get Model State

The model takes ~30 minutes to train. You will get an email once the model is trained. In the meanwhile, you check the state of the model

watch -n 100 python ./code/

Step 9: Make Prediction

Once the model is trained. You can make predictions using the model

python ./code/ PATH_TO_YOUR_IMAGE.jpg


NanoNets: Machine Learning API