Using Transfer-Learning to Detect Yoga Poses

Nhoral
7 min readJan 31, 2019

--

Photo by Juliette Leufke on Unsplash

Disclaimer: This article is more about a practical use for Transfer Learning, and not an in-depth exploration of neural networks. If any of this interests you, I highly recommend taking Jeremy Howard’s amazing course — Practical Deep Learning for Coders.

Exercises tend to have very distinctive poses, which makes them an excellent use-case for image classification. Yoga is a particularly easy classification problem (my favorite kind) because we aren’t evaluating motion.

Transfer Learning

Training a model from scratch to recognize Yoga poses is certainly doable, but do I have the time and patience for that? Do you? Do any of us, really?

Why not pilfer the work of others and utilize Torchvision? The PyTorch vision library has done the work of translating state-of-the-art architecture into easy-to-use PyTorch modules.

Specifically, we can use a model architecture called ResNet34. This name is a shortened version of Residual Network, which is a type of architecture good at resolving image features and training well. The 34 comes from the size of the architecture. It’s a good size to train fast and still get great results.

If I was willing to pay more than 50 cents per hour for my GPU, we could choose a larger architecture and get a superior model. Alas, I am very cheap.

By using the Torchvision model, we have a great architecture for our problem, no assembly required (yet). Even better, since we aren’t constructing our own kooky architecture, we can create pre-trained models. A pre-trained model is initialized into a specific state and borrows the hours upon hours of training someone already put into their model.

Thankfully, great and wonderful humans have trained this architecture against 1.5 million images spanning 32 thousand categories. By using these pre-trained model parameters, our model can inherit the knowledge of their world-class classifier.

Keeping The Good Stuff

The model architecture we are using (ResNet34) is a type of Convolutional Neural Network. This type of neural network processes an image into features, which are resolved from convolutional filters.

From Visualizing Residual Networks: https://arxiv.org/pdf/1701.02362.pdf

The values in these filters are adjusted during training to resolve the features most informative about what you are trying to classify.

Because true angels in mortal form have already trained ResNet34 for an absurdly long time against ImageNet, we can initialize our model into the same state as the model they trained. In practical terms, this means our model is already great at identifying information from images like People, Faces, Limbs, etc.

However, it’s also great at finding things we don’t care about. There’s no sense keeping information around about Dogs, Cats, or a loving embrace. If we get rid of that pesky knowledge about things we don’t care about, we can refine those filters to be even better at what we do care about!

Mugging Google Images For Pictures

We have two things going for us. We can utilize Transfer Learning to get a giant head start to classifying Yoga poses. Additionally, I picked a really easy thing to classify — a pose.

The combination of those two things means that we don’t need many images to train. A few hundred in each pose should be sufficient. We can lean on Google to classify our training data for us (by trusting their search algorithm).

Google Image Search for “down dog pose” yoga

The image quality overall is really great, but there are definitely some false positives included.

Down dog pose results that are more complicated / incorrect

Given the quality of the images in general, these exotic or incorrect examples won’t prevent training. Even so, they will hinder training and we’ll see later how to correct the data.

We can take all these image urls and dump them into a csv for our pose:

Each csv file is just a list of image urls to download for a pose

Because I know nothing about Yoga, I chose a few poses that came up in the 0.2 microseconds I searched about it:

  • Down Dog
  • (What’s) Up Dog
  • Tree
  • Warrior
  • Low Lunge

Turning Images Into Training Sausages

To train a model we need some mechanism that can evaluate our inputs against a set of desired outputs and update the model based on its performance. We could make our own, but that seems like a lot of work. Why build a cow when you can get the milk for free?

Instead, we can utilize the CNN Learner from Fast.ai. Not only is this a dead simple way to train a model, but comes with some handy utilities we will explore later. To use a Learner, we’ll need to provide it a model architecture (which we have) and a DataBunch (which we don’t).

A Fast.ai DataBunch is a hybrid data manager/data loader which encapsulates your training and validation sets (and a lot more that I won’t touch on).

We will train our model in iterations of batches of images, rather than the entirety of all images at once. Like our Learner, a DataBunch has a lot of convenient utility methods built-in. The show_batch method allows us to sanity check our data by visually evaluating a training batch.

With a DataBunch and our model architecture, we can create a Fast.ai Learner to start training. The create_cnn method will do the heavy lifting of adding the required neural net layers specific to our problem to the core ResNet34 architecture.

I can’t overstate the wealth of wonderful magic that create_cnn is handling for us

We pass error_rate into our create call to specify an additional metric to be printed during training (which we’ll see below).

Time To Hit The Books

Because we are using a pre-trained model, we want to utilize the existing core architecture, but add on to it for our specific purpose. Those core layers are already tuned to perfection* for the ImageNet problem, which is similar to ours.

However, the new layers we added have random values. They define the undiscovered country of what ResNet34 values best identify a particular pose. Before we start messing with every value in our model, we can freeze those layers we know are probably pretty close to what we want. This happens by default when you specify the model is pretrained in the create_cnn call.

If we didn’t do this, the magnitude of how incorrect our new, very dumb layers are currently would cause huge corrections. These large corrections will also impact the core layers, removing some of the information we wanted our model to inherit.

With our pre-trained layers frozen, we can finally start training our pose classifier.

Prepare to inversely tie your joy to a number

In just over a minute, we are already at an error rate of 0.33 (or 67% accuracy). Our new layers have rapidly learned to recognize the features of yoga poses. This makes sense, because the architecture we are using is world-class at resolving images into useful features!

You Are No Child Of Mine

Our model is still pretty crappy in 2019 terms, we can definitely do better than 67%. Outside of raw accuracy, we can evaluate the performance of our classification model by looking at a confusion matrix.

It was here I realized low-lunge is probably not a pose, but a characteristic of poses

Not bad for a minute of training, but there is definitely some confusion. Low lunge in particular is very hard to classify correctly.

The lowest hanging fruit to improve performance is to improve the data. We know that we have some incorrect images, so let’s try and fix them. An easy way to find the images that are probably wrong is by looking at which ones had the most incorrect predictions.

Where was the model confident and also wrong?

The Fast.ai ImageCleaner is a widget for Jupyter Notebook that allows us to visually remove or re-classify images. Given a large dataset, manually inspecting each item might be an unrealistic way to fix your data. Even so, this tool allows you to quickly correct the most impactful mistakes.

With our data better classified we can train again.

71% and getting better, but these are still rookie numbers.

Taking The Training Wheels Off

Remember those layers we froze? Well it’s about to get very warm, because we want to adjust those values as well. Why mess with values gained from lots of training? Because they are tuned for more general classification than we need.

It’s time to start allowing the model to adjust those features it is capturing to only care about Yoga-related information. That said, we are changing the values of millions of numbers, so it’s good to go slow.

We will use Discriminative Fine Tuning, which is a fancy way to say we will train the different layers at different rates. This will allow us to keep training our new layers at the same rate, but adjust our pre-trained layers as a slightly lower rate.

After ten more minutes of training, we are up to an accuracy of 84%. For a classifier we built in 30 minutes, that isn’t bad. I am confident that with better data categorization and training, we could significantly improve that even further.

Test Drive

You can try it yourself here or check out the code on GitHub.

--

--