Implementing an Image Classifier with PyTorch: Part 1
The first of three articles exploring a PyTorch project from Udacity’s AI Programming with Python Nanodegree program.
After a few months of challenging and rewarding learning, I recently graduated from the AI Programming with Python Nanodegree program at Udacity. The final project required for successful completion of the program was to create an Image Classifier for flowers of 102 different types using PyTorch.
While working on this final project, I discovered that I was finding similar issues and challenges as many of my fellow students. As I progressed towards completion, I made the decision to share some tips and insights that could benefit others in the future.
Over the course of three short posts, I will introduce you to the conceptual basics on how to implement an image classifier — which is an algorithm that can understand the content of an image.
My goal here is not to provide step-by-step instructions, but to help you to understand the overall process. If you are considering studying Machine Learning or Artificial Intelligence, you will have to do similar projects and understand the concepts that I will cover during this series.
What I will explain here is mostly conceptual. You don’t need to know how to write code to be able to follow along. Also, the PyTorch specifics included below are minor, and PyTorch is mainly used as an example.
The first step in the process is to load a pre-trained neural network. In discussing this step, I will explain why you want to “reuse” networks (i.e. use “pre-trained” networks), clarify which parts can and cannot be reused, and provide guidance on how to customize a pre-trained network for your needs.
Loading a pre-trained network
Reusing things is a perfectly reasonable strategy, particularly when those things are well known and widely-recognized standards. In this example, our starting point will be one of the model architectures that torchvision provides.
Our goal here is to load one of these pre-trained networks and replace its classifier with our own. Once done, we will train our classifier.
While the idea is reasonable, I found that it also generates some confusion due to the fact that loading a pre-trained network does not save us the effort of training our classifier.
“So you might wonder, what’s the point of using a pre-trained network?”
As humans, when we see an image, we can identify lines and shapes. Thanks to that, we can relate the image content to something that we have seen before. We want our classifier to be able to do something similar, but an image is not a trivial piece of data. Images are generally made up of thousands of independent pixels, each of which has a color which is defined by the combination of three different values: red, green, and blue.
If we want our classifier to be able to handle that volume of data, we will need to process all that information contained in each image and feed it to the classifier in a format that it can understand. That’s where the pre-trained network comes into play.
These pre-trained networks are primarily composed of a set of feature detectors and a classifier, where the feature detectors are trained to extract the information from each image, and the classifier is trained to understand the input that the feature layers provide.
The feature detectors have been trained on ImageNet and are proven to work well. Because of this, we want to keep them as they are. To prevent the feature layers from being modified as we train our classifier, we need to “freeze” them. This small code snippet can help with that:
for param in model.parameters(): param.requires_grad = False
What about the classifier? Why can’t we reuse it as well? To answer this question, let’s take an architecture — VGG16 for example — and have a look at its default classifier:
(classifier): Sequential( (0): Linear(in_features=25088, out_features=4096, bias=True) (1): ReLU(inplace) (2): Dropout(p=0.5) (3): Linear(in_features=4096, out_features=4096, bias=True) (4): ReLU(inplace) (5): Dropout(p=0.5) (6): Linear(in_features=4096, out_features=1000, bias=True))
First of all, there is no guarantee that this would work for us. It is doubtful that this number of default layers and elements, activation functions and dropout values, are precisely the best ones for our specific situation.
That this is the case becomes obvious when we see that its last layer has an output of 1000 elements. In our instance, we are handling 102 different types of flowers so the output of our classifier must be 102 instead.
From the default classifier in VGG16 above, we can also notice that its input layer has 25088 elements, as this is the output size of the feature detectors in this particular pre-trained model. The input size of our classifier must also match the output of the feature layers.
Conclusion
What we can see above, is that pre-trained networks are incredibly beneficial, as they allow us to focus on the specifics of our use case, while reusing well-known generics for the image pre-processing in the example.
We have also learned that the size of our classifier’s output must be the same as the number of different types that we want to be able to identify.
Finally, we have seen that the output of the feature layers, and the input of our custom classifier, must match in size as well.
In my next article, we will explore how to avoid common pitfalls during the training of our classifier, and learn how to tweak the hyperparameters to improve the accuracy of our module.
Was this useful to you? Please comment, and let me know!