Image Classifier 101: A Dog Breed Example

17 min readMay 26, 2020

In this article, I will introduce a machine learning project on how to build an image classifier using convolutional neutral networks (CNNs) to predict dog breed from an image, followed by some discussion on model evaluation and reflections.

This project is one of the capstone projects included in Udacity’s Data Science Nanodegree program.

1. Project Overview

1.1 Problem Domain

Usually, people want to know the breed of a dog just because of curiosity. However, identifying a dog breed in a quick and accurate way also has many serious applications, such as finding lost dogs, or better matching dogs to personal or organizational adopters based on their characteristics and potential behaviors associated with the breed instead of just appearance.

Machine learning, especially deep learning, has been widely used to solve image recognition and classification problems by learning from tons of data and identifying feature patterns in an iterative process to make decisions.

In this project, I will use deep learning to predict dog breed using a small number of images.

1.2 Data and Inputs

The data I will use include both human face images and dog images. The human dataset contains 13,233 images. The dog dataset contains 8,351 images in 133 categories (more details in 4. Data Exploration). I will use a Jupyter Notebook template to work on this project in GPU-enabled mode.

2. Problem Statement

2.1 Problem Statement

To make predictions, I will use TensorFlow(as backend) and Keras to create a CNN model for dog breed classification with labeled images, and then wrap things up into an algorithm that can process any user-supplied images.

The developed algorithm should be able to:

1. Detect and predict the breed of a dog in an image;
2. Detect the human face in an image and predict the resembling dog breed.
3. Report an error when neither human face nor dog is detected in an image

Specifically, I will test both the performance of the human detector and the dog detector on a small sample of human and dog images, respectively. Then we can calculate the detection accuracy (see 3. Metrics for more details).

2.2 Solution Statement

I will apply TensorFlow and Keras implementation of the classifier using CNN (from scratch and use transfer learning). Specifically, I will use OpenCV implementation of a pre-trained haar feature-based cascade face detector to detect human faces in the input images, and use a pre-trained ResNet-50 model to detect dogs in the input images.

After that, I will create a CNN architecture to classify dog breeds from scratch with a test accuracy of at least 1%. To improve the performance of the model, transfer learning will be applied to create a CNN to achieve much higher accuracy of at least 60% on the test set.

2.3 Project workflow

Step 0: Import Datasets:
Import dog images and human images, and create Numpy arrays of all the images for future processing.

Step 1: Detect Human

Create and assess a human face detector to detect human faces in images using OpenCV(Open Source Computer Vision Library)’s implementation of Haar feature-based cascade classifiers.

Step 2: Detect Dogs

Preprocess the dog images and use a pre-trained ResNet-50 model to detect dogs in the images.
Create and assess the dog detector.

Step 3: Create a CNN to Classify Dog Breeds (from Scratch)

Preprocess the dog images
Build a multi-layer CNN model architecture
Compile and train the built model
Load model with best validation loss
Test the model accuracy (should be greater than 1%)

Step 4: Create a CNN to Classify Dog Breeds (using Transfer Learning)

Obtain Bottleneck Features
Build model architecture using a pre-trained ResNet-50 model
Compile and train the transfer learning model
Load model with best validation loss
Test the model accuracy (should be greater than 60%)

Step 5: Write Your Algorithm

Combine the pre-defined human face and dog detector functions with the ResNet-50 based prediction model

Step 6: Test Your Algorithm

Test the algorithm with various types of user-provided images and output the predicted dog breed, resembling dog breed or an error.

3. Metrics

In this project, since our goal is fairly simple: detect the correct dog breed in a dog image, I will use Accuracy score — a very straightforward concept as the only metric to measure detection and prediction performance.

For example, for detection accuracy and model prediction accuracy, the mathematical representations are :

Number of images with dog detected/Total images = Accuracy of the dog detector

Number of dog breed correctly predicted /Total images = Accuracy of the model

4. Data Exploration

Step 0: Import Datasets:

I have imported dog images and created:

Numpy arrays containing file paths to train, validation and test images
Numpy arrays containing one hot-encoded classification labels for training, validation and test
A list of string-valued dog breed names for translating labels, dog_names

I have also imported human images and created:
A Numpy array of human image file paths, human_files

After loading all the datasets using glob and the provided image paths, I found that the dog dataset contains 8,351 images in 133 categories (split in training, validation, and test sets). And there’re 13,233 human face images in the human dataset.

# Number of dog images
There are 133 total dog categories.
There are 8351 total dog images.

There are 6680 training dog images.
There are 835 validation dog images.
There are 836 test dog images.# Number of human face images
There are 13233 total human images.

There’re only about 50 images on average (6,680 / 133) per dog breed category in the training set, this data size is pretty small for building deep learning algorithms. But I will use them anyway to get a general idea about the model performance.

5. Data Visualization

Before preprocessing all images, I took a quick look at the distribution of the number of dog images in each breed category:

Distribution of dog images in each breed

The image count distribution of all the dog breeds looks not very balanced, (vary from about 77 images to 26 images per dog breed).

After plotting the width and height of all the dog images, I found that the sizes of the dog images vary from ~100×100 pixels to ~ 3,700×4,200 pixels, as illustrated in some sample images below:

After plotting the width and height of all the human face images, I found that all the images are in the same size of 250×250 pixels, as shown in the sample images below:

6. Data Preprocessing

6.1 Data Preprocessing for Detecting Human

In this step, I will create and assess a human face detector to detect human face in an image using OpenCV’s implementation of Haar feature-based cascade classifiers. One of the detectors have been downloaded and store as:

'haarcascade_frontalface_alt.xml'

Then I will use the following human face detector to detect human faces in both dog and human images:

def face_detector(img_path):
    """
    Detect human face in the image.
    Input: img_path - path of the image file
    Output: returns True if a human face is detected, and False otherwise. 
    """
    
    img = cv2.imread(img_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray)
    return len(faces) > 0

s6.2 Data Preprocessing for Detecting Dogs

In this step, I will use a pre-trained ResNet-50 model (a deep Residual Network with 50 layers. More details discussed in 10.2) to detect dogs in images. This model contains weights that have been trained on ImageNet, a very large dataset used widely for computer vision tasks.

from keras.applications.resnet50 import ResNet50# define ResNet50 model
ResNet50_model = ResNet50(weights='imagenet')

But to fully take advantages of a pre-trained model, I first need to go through some data preprocessing steps.

First, with TensorFlow as backend, Keras CNNs require a 4D array (or a 4D tensor) as input for each image, with the shape of (number of samples, rows, columns, channels), or (1, 224, 224, 3) in our case.

To get the 4D arrays, I need to apply the path_to_tensor function, which takes a file path to a color image as input and returns a 4D tensor that can be processed by Keras CNNs. This function will load the image and resize it to a square one with 224 × 224 pixels.

from keras.preprocessing import imagedef path_to_tensor(img_path):
 
 # loads RGB image as PIL.Image.Image type
 img = image.load_img(img_path, target_size=(224, 224))# convert PIL.Image.Image type to 3D tensor with shape (224,224,3)
 x = image.img_to_array(img)# convert 3D tensor to 4D tensor with shape (1, 224, 224, 3) and return 4D tensor
 return np.expand_dims(x, axis=0)

In the next step, I need to convert the RGB image to BGR by reordering the channels, and subtract the mean pixel from every pixel in each image, which is a normalization step required in all pre-trained models.

To do this, I will use an imported preprocess_input function, which returns an array of the model’s predicted probabilities (derived from the available categories in ImageNet).

Finally, I can then obtain an integer corresponding to the model’s predicted object class, which can be identified in this dictionary. This can be achieved by the ResNet50_predict_labels function:

from keras.applications.resnet50 import preprocess_input, decode_predictionsdef ResNet50_predict_labels(img_path):
 # returns prediction vector for image located at img_path
 img = preprocess_input(path_to_tensor(img_path))
 return np.argmax(ResNet50_model.predict(img))

6.2 Data Preprocessing for Creating a CNN to Classify Dog Breeds from Scratch

I will rescale the images by dividing every pixel in every image by 255 for CNNs to work properly.

# preprocess the data for Keras
train_tensors = paths_to_tensor(train_files).astype('float32')/255
valid_tensors = paths_to_tensor(valid_files).astype('float32')/255
test_tensors = paths_to_tensor(test_files).astype('float32')/255# preprocess the data for Keras
train_tensors = paths_to_tensor(train_files).astype('float32')/255
valid_tensors = paths_to_tensor(valid_files).astype('float32')/255
test_tensors = paths_to_tensor(test_files).astype('float32')/255

6.3 Data Preprocessing for creating a CNN to Classify Dog Breeds using Transfer Learning

Finally, I will extract the bottleneck features corresponding to the train, test, and validation sets for ResNet-50 pre-trained model to process:

# ResNet-50bottleneck_features = np.load('bottleneck_features/DogResnet50Data.npz')train_resnet50 = bottleneck_features['train']
valid_resnet50 = bottleneck_features['valid']
test_resnet50 = bottleneck_features['test']

7. Implementation

7.1 Step 1: Detect Human

Using the previously built human face detector, I got the following results on 100 human face images and 100 dog images:

There are 100.0% human faces detected in human images.
There are 11.0% human faces detected in dog images.

Apparently, this detector needs to be improved for better accuracy. But I will use it for now in this project.

7.2 Step 2: Detect Dogs

Using the previously built dog detector, I got the following results on dog images:

There are 0.0% dogs detected in human images.
There are 100.0% dogs detected in dog images.

which looks pretty good!

7.3 Step 3: Create a CNN to Classify Dog Breeds (from Scratch)

With functions available for detecting humans and dogs in images, I will first create a CNN model from scratch and must obtain a test accuracy of at least 1% (which is better than a random guess).

I built a CNN model architecture in the following steps:

Define CNN layers: Create 4 convolutional layers, each with 3 color channels, and 16, 32, 64 and 128 filters, respectively (with convolution size = 3, padding = ‘same’(zero padding)). The input layer has a 224 x 224 x 3 dimension and the output layer contains 133-dimension predicted dog breeds.
Dimension reduction: Reduce the input dimension by a factor of 2 with a max pooling layer after each defined layer.
Define fully connected layers: Create a hidden layer with 512 nodes and an output layer with 133 nodes to generate output features with a dimension of 133.
Apply flatten and dropout: Add one flatten layer to flatten the data, and two dropout layers with probability of 0.5 to reduce overfitting.
Define activation functions: ReLU is used as the activation function in the hidden layers and Softmax is used in the output layer.

Before testing how well the model performs on test dog images, I also compiled the model with specified optimizer, loss function, metric, epochs, batch size, as well as the best model saved during training.

Here, it’s good to be careful in selecting appropriate architecture components and parameters to get the optimized accuracy with proper training time.

model.compile(optimizer = 'rmsprop', 
              loss ='categorical_crossentropy', 
              metrics = ['accuracy'])epochs=10
batch_size=20checkpointer = ModelCheckpoint(
           filepath='saved_models/weights.best.from_scratch.hdf5', 
           verbose=1, save_best_only=True)model.fit(train_tensors, train_targets, 
          validation_data=(valid_tensors, valid_targets),
          epochs=epochs, batch_size=batch_size, 
          callbacks=[checkpointer], verbose=1)

7.2 Step 4: Use a CNN to Classify Dog Breeds (using Transfer Learning)

To reduce training time without losing much accuracy, I also (and more importantly!) tried to train a CNN using transfer learning. This time, I need to make sure that the test accuracy is greater than 60%.

# Compile the model
resnet50_model.compile(loss = 'categorical_crossentropy', 
                       optimizer = 'rmsprop', 
                       metrics = ['accuracy'])# train the model
checkpointer_resnet50 = ModelCheckpoint(\
               filepath = 'saved_models/weights.best.resnet50.hdf5', 
               verbose = 1, save_best_only = True)resnet50_model.fit(train_resnet50, train_targets, 
                   validation_data = (valid_resnet50,valid_targets),
                   epochs = 20, batch_size = 32,
                   callbacks = [checkpointer_resnet50], verbose = 1)

8. Refinement

In order to improve the performance of CNN model built from scratch, I plan to try a few different things:

Adding more convolutional layers (e.g., adding a 256-filter convolutional layer after the 128-filter layer). More layers usually lead to better accuracy since each layer extracts less features than the previous layers and provides more details of the small regions. But the training will take longer since we will have more parameters to train in each epoch.
Changing the stride from 1 (default) to 2. This adjustment may lead to a decrease in accuracy , since a larger step size may be prone to ignoring some detailed information that is key for an accurate result, although the training time can be shorted by using a larger stride.
Using different dropout probabilities. With a larger dropout, we can randomly turn off more units in the network each time during training and force the other units with less weights to be trained as much as possible. In this way, we can make the network more resistant and avoid overfitting.
Training for more epochs. More training epochs may help the neural network to be fully trained but also make it prone to overfitting and can increase the training time.

9. Model Evaluation and Validation

9.1 Model Evaluation of CNN Model Built from Scratch

The detailed model architecture and summary of the CNN model built from scratch is shown below:

from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequentialmodel = Sequential()
          
model.add(Conv2D(filters = 16, kernel_size = (3,3), 
                 padding = 'same', activation = 'relu', 
                 input_shape = (224, 224, 3))) 
model.add(MaxPooling2D(pool_size = (2,2)))model.add(Conv2D(filters = 32, kernel_size = (3,3), 
                 padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2)))model.add(Conv2D(filters = 64, kernel_size = (3,3), 
                 padding = 'same', activation = 'relu')) 
model.add(MaxPooling2D(pool_size = (2,2)))model.add(Conv2D(filters = 128, kernel_size = (3,3), 
                 padding = 'same', activation = 'relu')) 
model.add(MaxPooling2D(pool_size = (2,2)))
          
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(512, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(133, activation = 'softmax'))
              
model.summary()

A summary of the CNN model built from scratch

From the summary table, we can see that with each convolutional layer, we’re doubling the number of nodes and reducing the input image size by half. After applying dropout, flatten (14×14×128=25088) and a dense layer with 512 nodes, we can output the results in 133 classes.

Since this model is only trained on original images without any image augmentation (rotate, flip, zoom, etc.) and our training data is very small (tens of images per category), this model may not be very robust on predicting new images due to small variety of training data.

9.2 Model Evaluation of CNN Model Built using Transfer Learning

The detailed model architecture and summary of the CNN model built using transfer learning is shown below:

from keras.layers import GlobalAveragePooling2D# Model architecture
resnet50_model = Sequential()resnet50_model.add(GlobalAveragePooling2D(
                   input_shape = train_resnet50.shape[1:])
                   )resnet50_model.add(Dense(133, activation = 'softmax'))resnet50_model.summary()

A summary of the CNN model built using pre-trained ResNet-50 model

Since in transfer learning, we simply freeze the weights for the entire network except those of the output layer, and add a global average pooling layer and a fully connected layer with random weights, this new output layer is the only one trained in this transfer learning model, therefore we have much less parameters to train (less than 300,000 compared to 13 million) during the training process and can significantly speed up the process.

Although this model is not trained on augmented images either in this case, it has been pre-trained on tons of other images from ImageNet, its robustness should be better than our freshly built CNN model.

9.3 Step 5. Model Validation: Write Your Algorithm

Finally, I combined the pre-defined human face and dog detector functions with the ResNet-50 based prediction model to generate an algorithm for dog and human face detection:

# Detect human face
def face_detector(img_path):
    """
    Detect human face in the image.
    Input: img_path - path of the image file
    Output: returns True if a human face is detected, and False otherwise. 
    """
    
    img = cv2.imread(img_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray)
    return len(faces) > 0
# Detect dog
def dog_detector(img_path):
    """
    Detect dog in the image.
    Input: img_path - path of the image file
    Output: returns True if a dog is detected, and False otherwise. 
    """
    prediction = ResNet50_predict_labels(img_path)
    return ((prediction <= 268) & (prediction >= 151))
# Detect the dog breed or resembling dog breed in the image
def photo_detection_resnet50(img_path):
    """
    Predict the dog breed or resembling dog breed in the image
    INPUT: img_path - path to an image
    OUTPUT: returns a prediction of (resembling) dog breed
    """
    
    img = Image.open(img_path)
    plt.imshow(img)
    plt.show()    human_detected = face_detector(img_path)
    dog_detected = dog_detector(img_path)
    
    if human_detected:
        print("Hello human! You look like a ... {}!"\
              .format(resnet50_predict_breed(img_path)))
    elif dog_detected:
        print("Hello dog! Your breed is predicted as ... {}!"\
              .format(resnet50_predict_breed(img_path)))
    else:
        print("We didn't detect any human or dog in the image. Would you like to try another one?")

10. Justification

10.1. Results of CNN Model Built from Scratch

The training and validation accuracy curves during 10 epochs are shown below:

Accuracy (top) and Loss (bottom) curve during training of CNN model built from scratch

With this model architecture, I got an accuracy of 12.20 % on the test images.

The training loss starts to increase after about 6 epochs and the accuracy doesn’t improve much after 6 epochs either.

I actually can further improve the model performance by adding a 256-filter convolutional layer after the 128-filter layer, the performance of the model can improve about 2–3% on accuracy, but the training will take longer.

I also changed the stride from 1 (default) to 2. However, this adjustment led to a decreased accuracy of < 10%, which means a step size of 2 may be too large to collect enough important features to get an accuracy result.

And different dropout probabilities were also tested. It turns out a dropout of 0.5 can give slightly better results than a dropout value of 0.2 or 0.3 with an improvement of 1–2%. In this way, the overfitting could be minimized, therefore a higher accuracy can be achieved.

Increasing training epochs from 10 to 20 doesn’t seem to help much since the accuracy is about the same, but the training time is doubled. Therefore an epoch of 10 should be enough to get a good accuracy.

With the above settings, this model is able to provide a consistent accuracy of 11–13% each time after training for multiple times.

10.2. Results of CNN Model Built using Transfer Learning

The training and validation accuracy curves during 20 epochs are shown below:

Accuracy (top) and Loss (bottom) curve during training of resnet50_model

With this mode architecture, we get an accuracy of 82.54 % on the test images.

The training loss starts to increase after about 3 epochs and the accuracy doesn’t improve much after 3 epochs. This transfer learning model can provide a consistent accuracy of above 80% each time after training for multiple time.

Apparently, the accuracy of the CNN model based on pre-trained model is much higher than the CNN model built from scratch. Because pre-trained networks usually have great performance on the new data they haven’t seen before, they are widely applied on image recognition problems.

ResNet’s key idea is to introduce a “identity shortcut connection” concept in the feedforward network and skips one layer or more without introducing extra parameters or computation complexity.

Beside the very high accuracy, an additional advantage of using transfer learning technique is to be able to reduce the training time significantly because we only need to train the variables in the last classification layer or output layer, instead of the entire model.

For example, with GPU enabled, the CNN model built from scratch took a few minutes to train for 10 epochs, but the CNN model based on ResNet-50 only took a few seconds to train for 20 epochs. Transfer learning is definitely a time saver!

10.3 Step 6. Write Your Algorithm

The prediction results of the ResNet-50 based CNN model on some new images using the ‘photo_detection_resnet50’ function introduced above are shown below:

The outcomes are actually better than I expected since this algorithm is far from perfect!

11. Reflection

I’m choosing this project because I’ve gained some EDA, data wrangling and basic prediction skills in the past, but I have very limited experience in using machine learning, especially deep learning to solve image classification or recognition problems.

Therefore I decided to challenge myself to gain some new skills that are not fully covered in the nanodegree. Fortunately, Udacity also provides many interesting free, entry-level courses to help beginners break into a new field.

There are three things in this project that I found difficult but also interesting: human face detection, data preprocessing for Keras CNNs, and building up a CNN model architecture. Although the code looks simple, it took a lot of time and research to understand the working principles behind these concepts.

At the end of this project, I have —

learnt how to build a human face detector and a dog detector.
learnt how to preprocess the image data to be properly processed by Keras CNN models.
understood how a CNN works and been able to create a CNN model from scratch with an accuracy > 10% (benchmark is 1%).
understood what is transfer learning and been able to create a CNN model using transfer learning with an accuracy > 80% (benchmark is 60%).
successfully wrapped up an algorithm to predict dog breed on user-supplied images with a satisfactory outcome.

I am also interested in learning more about how to improve the model robustness and performance by applying image augmentation and fine-tuning models. This may help me find out the secret sauce of those algorithms that get high ratings in data science competitions like Kaggle.

12. Improvement

There’re multiple things that can be improved in this process:

1. A balanced dataset with more training images of dogs in each breed category are highly desired.
2. A lot more fine tuning work could be done to improve the accuracy:
* Apply image augmentation (rotate, flip, zoom, etc.),
* Use a different CNN architecture,
* Use another pre-trained network,
* Use other types of optimizers and loss functions,
* Apply an optimized learning rate, etc.
3. Collect the prediction results of different dog breeds and list the possibility of each prediction in a descending order.

The code behind this article can be found in this Github Repository.