If I Can You Can (and you should!)

13 min readOct 29, 2018

This fall I’m participating in fast.ai’s Practical Deep Learning for Coders course that meets for a few hours on one evening each week at USF. We kicked off the first session this past Monday, during which we learned how to code up and train a deep learning classifier that can distinguish between thirty-seven breeds of cats and dogs at an error rate of around five percent.

A few days after our first class, I decided to see if I could just as easily build a classifier of my own, except perhaps one that’s able to do something a little more personally interesting to me than figuring out cat and dog breeds. Growing up in the Bay Area as a kid, one of my favorite activities was going on hikes at the nearby nature reserves and searching for all the eye-catching and colorful bird species that couldn’t be found on my cul-de-sac in Sunnyvale.

Since Blue Jays had always been my favorite bird, I decided I’d try and train a deep learning model to differentiate between five different Blue Jay species.

Clockwise from upper-left: Blue Jay, Steller’s Jay, Florida Scrub-Jay, California Scrub-Jay, Woodhouse’s Scrub-Jay

The first thing one might notice about the above birds is that, at least to the untrained eye, it gets progressively more difficult to distinguish one species from the next. Sure, telling a Blue Jay apart from a Steller’s Jay apart from all the Scrub-Jays is easy enough. But what about discriminating amongst the different Scrub-Jays — especially between the California and Woodhouse varieties? If it weren’t for the fact that the Cornell University Lab of Ornithology says so, I’m not sure I could be convinced that they are two different species.

Are these really different species? According to the Cornell Lab of Ornithology: Yes!

I was initially skeptical that I’d be able to train a deep learning model that could come anywhere close to being able to distinguish between such similar looking species. And what’s more, since to the best of my knowledge there isn’t a publicly available imageset of Blue Jays that I could use to train my model, I would have to curate my own by hand. In the interest of time, I decided that I’d collect only 100 images of each of the five Blue Jay species. It would be interesting to see if a grand total of 500 images would be enough to train and validate a decent classifier.

On the Importance of Validation Sets

It’s important to remember that my goal was to build a model that could correctly predict the species of any Blue Jay it sees, at any time. In order to ascertain whether or not my model was making progress in learning patterns that generalized beyond the specific set of images it’s trained on, I would set aside roughly one-fifth of my 500 images to serve as a validation set. Because my model never saw the images in the validation set during training, measuring how accurately my model predicted the Blue Jay species for each image in this set allowed me to simulate how well my model might perform if I were to use it in the real world — say, if I were to take a picture of a Blue Jay in my backyard and then have my model predict its species.

This left around only 400 images, or seventy-five images for each Blue Jay species, would be actually used for training my model.

Downloading the Images

The Cornell Ornithology Lab’s Macaulay Library has a user-friendly portal that serves up tens of thousands of high quality images of just about any species of bird imaginable. These images are ostensibly crowd-sourced from the global bird-watching community. An educator/researcher-friendly terms-of-service and an API that returns metadata and direct download links for thirty images at a time made it relatively easy for me to download 500 high quality Blue Jay images .

Better than Google image search (for birds, at least)

For each species, I clicked ‘Save Spreadsheet’ to grab the metadata and download links to the thirty highest quality images from 2018. I repeated the process three more times to get thirty images from 2017, 2016, and then ten images from 2015. Once I had 100 download links for each of the five species, I just used a slight modification fastai’s useful download_images() method to actually download and store the images.

Actually Building the Deep Learning Model

Folks who’ve had some experience building deep learning models might expect that at this point my work is only beginning. After all, I’ve only just acquired my imageset. Don’t I need I need to do things like download a bottleneck with pre-trained weights, code up a fully-connected network head, tune hyperparameters, and all that other jazz? The truth was, it took only around ten more minutes worth of work (only three of which were spent actually coding) until I had a fully-trained classifier that I was well and pleased with.

So believe it or not, we’re actually on the home stretch! Let me show you what I did during those three minutes of coding.

I had already downloaded my Blue Jay images to five appropriately named folders — one for each species. The following code was enough to tell fastai where my images were, as well as the names of my five Blue Jay species.

I take responsibility for the lambda. fastai also supports (way more concise) regex patterns, but alas, I didn’t really know regex and the official Python 3 tutorial was, shall we say, less than accessible. (Edit: I should have tried this site.)

Next I did a quick sanity check to make sure that fastai and I were on the same page about the images that were being used.

All five Blue Jay species represented, and all pics contain birds. Good!

When creating an ImageDataBunch object, fastai automatically resizes and crops (if necessary) all my images to square-shaped images of the size (512x512 pixels in my case) that’s specified.

Image Augmentations and Why They Matter

Pay special attention to the ds_tfms=get_transforms() parameter. This tells fastai to randomly apply a pre-defined set of image augmentations to my training images. fastai’s default augmentations, which include a range of image flips, zooms, rotations, and brightness adjustments, happen to work well for side-facing images like my Blue Jay images.

In principle, image augmentations make it possible to train a deep learning model on a relatively small imageset without overfitting (discovering patterns that only apply to the training set, but don’t generalize). Applying image augmentations means that my model isn’t only trained on the original set of training images, but also on new images that are modified (flipped, zoomed, rotated, etc.) versions of the original training images. The key is that these modifications are drastic enough so that my model will see the modified images as if they were brand new images, but not so drastic that they would impair my model’s ability to learn the general patterns of telling one Blue Jay breed apart from the next.

Defining a CNN in One Line of Code

Now it was time to code up my actual convolutional neural network. This was fully accomplished in one line of code.

Yep, that’s it.

In fastai, trainable neural nets are contained inside Learner objects. In the above line of code, I employed transfer learning and defined a CNN that’s built on a ResNet50 base that contains pre-trained weights. fastai then attaches a fully-connected layer that will output a set of five of probabilities. These are the probabilities that a particular bird’s image belongs to each of the five blue jay species. The species with the highest probability will be the species my model actually predicts as being depicted in a given photo.

Transfer learning basically means that I build a really shallow neural network of my own and attach it to a really big deep neural network that’s already been trained on an extensive dataset. As long as the problem I’m solving isn’t too different from the problem that the “really big deep neural network” was trained to solve, I can expect the performance and expertise of the pre-trained network to transfer well to the problem I’m trying to solve.

In my situation here, ResNet50 is merely the name of a really big deep neural network that was trained to classify images in the ImageNet imageset. ResNet50 is known to perform pretty well at this task, and because ImageNet contains images of many side-facing objects, including plenty of birds, I can expect that ResNet50 already knows enough about birds such that I believe I can save a lot of time by building on top of it, rather than building a brand new deep network from scratch.

Finding the Learning Rate

Notice that so far I haven’t tuned a single hyperparameter. fastai chooses default values that are based on best practices that work for the majority of common image classification tasks. All I need to do is find the optimal learning rate. This is accomplished with the lr_find() method, which calculates and plots validation loss across a range of possible learning rates. The rule of thumb is to choose a learning rate at the part of the graph where the rate of loss is steepest, just before the minimum is reached. I judged this threshold to be at 5e-3 in the graph that follows just below.

Looks like 5e-3 will make a nice learning rate.

When Simple and Straightforward is Best

After finding my learning rate, it’s now time to train my model. First, another quick aside: notice that fastai automatically handles the train-validation split, as well as the choice of loss-function. I’d manually go in and set the type of loss-function if I were trying to solve a different kind of problem (say, image segmentation), but for image classification, fastai’s default will work fine for me.

The only other reason I might spend some time fiddling with these settings (say, use n-fold cross-validation or a custom loss function — which would be doable, since fastai is built on PyTorch 1.0) is if I were trying to climb up a kaggle leaderboard. But real talk: if I’m just trying to build a production-ready model, having a rock-solid and straightforward implementation that works well enough to meet my users’ needs is always way more important to me than eke-ing out hundredths/thousandths of a point improvements on some metric.

Training with One-Cycle Learning

Anyways, enough on model philosophy, now it’s time to train!

You may be wondering what fit_one_cycle is. The short answer is that it’s the thing that’s going to save you money on AWS bills. The slightly longer answer is that it’s a training technique developed by Leslie Smith, a research scientist at the US Naval Research Laboratory, whereby learning rate and momentum vary over the course of one learning cycle, with the former increasing and then decreasing, and the latter varying inversely. Adjusting learning rate and momentum in this way allows the model find a minima of the loss function much more quickly. Leslie Smith refers to this as super convergence. If you’re not feeling up to reading the full paper on Arxiv, Sylvain Gugger wrote up a very accessible explanation here.

When training a model like the one I’ve built here, fastai by default freezes the weights of all lower layers, leaving only the weights of the top-most layer free to be trained. This makes sense because run-of-the-mill image classification problems will most likely involve the classification of objects that are represented in the ImageNet imageset (or are at least similar to objects whose images are included). Indeed, as I mentioned earlier, since my network’s backbone is a ResNet50 network that’s already been trained to classify images in ImageNet, I can conclude that these layers are already slightly “familiar” with Blue Jays. It’s thus no surprise to me that after only spending ten epochs training my network’s top layer, which had not been pre-trained, I was already able to arrive at a validation error rate of just over 0.09.

Plotting a confusion matrix shows me that my model’s ability to distinguish Blue Jay species was now about identical to my own: it could tell most species apart, but it struggled with the difference between California Scrub-Jays and Woodhouse’s Scrub-Jays.

I decided to see if I could further improve my model. In order to do this, it was time to unfreeze my network’s lower layers and see if I could fine-tune them enough so that they could (unlike me) tell the difference between a California Scrub-Jay and Woodhouse’s Scrub-Jay.

After using lr_find() once more to find a nice range of learning rates (where loss is still decreasing) to use to fine-tune my network’s lower layers, I trained for a final cycle of ten epochs.

The error rate for predictions on the validation set decreased to nearly 0.05! Given that I was training on only around seventy-five images per species, I was pleasantly surprised — my initial goal had merely been to get to a 0.10 error rate.

Well, at least it got better at California Scrub-Jays.

The final confusion matrix shows that my model got better at finding more of the Blue Jays and California Scrub-Jays, but still continued to struggle with correctly predicting all of the Woodhouse’s Scrub-Jays. If I were to have continued this exercise, I probably would have tried to curate some more training images at this point.

Wrapping Things Up: Two Lessons Learned

This led me to the first of two final realizations: curating datasets is a hugely important.

I hadn’t realized it until I actually tried to create one of my own, but I’ve made the mistake of taking datasets for granted. Be it either in a kaggle competition or a Udacity project, the data had always sorta just been there for me. I never thought much about who made it or how long it may have taken them to do so. The only thing that mattered to me was building a performant model. However, I can now clearly see that I was ignoring the more important half of the solution. The most clever algorithm will solve nothing if there isn’t any data to train it on.

And this leads me to my second and final realization: what datasets have yet to be created? Which problem spaces have yet to be jumped into? With just 500 Blue Jay photos, I’ve quickly built a prototype of tool that could conceivably enhance the expertise of park rangers or volunteer docents. In this age of continuous budget cuts for national parks and nature reserves, what training resources have been discontinued? How much expertise has been lost? What if park rangers could carry inside their pockets the collective, world-wide human expertise of identifying bird species, encapsulated in a deep learning model inside an app?

On the other side of the same coin, what if park-goers could download the same app and use it to learn how to identify the birds they see, similar to how prior generations might have used Roger Tory Peterson or Audubon bird field guides. Could a product or service like this enhance the experience of nature enthusiasts visiting national parks, similar to how listening to the narrated tour on headphones makes Alcatraz a much more interesting place for me to visit?

I am convinced that there are problem-spaces all around us to which no one has yet thought to apply deep learning. Products and solutions that address these problems will not come down from a guild of deep learning practitioners on high. Rather, it will be when the real people who face these problems everyday become aware of and learn how to apply deep learning, that we will begin to see some unimaginable solutions to a host of real-world problems.

My machine learning and deep learning journey almost ended before it ever began back in early January of this year, when I saw the following page on scikit-learn.org. I momentarily and incorrectly assumed I wasn’t “technical” enough to grasp machine learning because I couldn’t immediately understand the mathematical notation that was pasted into articles all over the website.

Pro-tip: that big ‘pi’-looking thing means “multiply.” And no, knowing that didn’t help me get better results with my Naive Bayes classifier.

Now I support math just as much as the next individual, but contrary to the impression given by photos on websites of certain AI startups, you don’t actually have to derive gradient descent from scratch on a whiteboard before you can build a deep learning model that’ll solve a problem you face.

Suffice to say, if I can do this, so can you. And you should. Because the world needs more real people who face real-world problems to begin experimenting with deep learning tools that might just help to solve those problems.

(Please feel free to view the notebook I used to build and train my model. Thanks also Cornell’s Macaulay Library and to all the birdwatchers who’ve submitted images to its repository. Metadata containing photographer names for each image in my imageset is available in the files in this folder located in my github repository for this project.)