Transfer Learning Explained

Our monthly analysis on machine learning trends

This post was originally sent as our monthly newsletter about trends in machine learning and artificial intelligence. If you’d like these analyses delivered directly to your inbox, subscribe here!

When it comes to technology, the pace of development tends to speed up with time. Ray Kurzweil called this phenomenon the “Law of Accelerating Returns.” We won’t cover technological history or exponential curves in this newsletter, but there’s a useful analogy for what we will cover: transfer learning.

Like most useful technologies, transfer learning allows one innovation to contribute to many others. Transistors, for example, didn’t have just one application: they were used in radios, calculators, and computers. Why shouldn’t the same concept apply to machine learning models? If you build a really effective image recognition model on a million photos of cats, you might wonder if the model could be useful for something else, maybe even something like medical diagnoses. Transfer learning makes these strange, almost miraculous cross-applications possible.

In fact, having used an analogy above, it’s worth pointing out that transfer learning is itself basically machine learning by analogy. Train a model on one set of data, then apply that model’s insights to a new task with a new set of data (or the same task with a new set of data, or the same data with a new task) — that’s transfer learning in a nutshell. Like a good analogy, the better the correspondence between the two things you’re transferring across, the more successful the result is likely to be. Here we’ll take a closer look at how transfer learning works, why it sometimes fails, and when it makes sense to use.

Transfer Learning: Making Less Data Cool Again

Imagine you’re trying to build a deep learning model but don’t have much training data. Maybe you’re trying to identify a rare skin disease and only have 100 images. Meanwhile, someone else has trained an image recognition model on a 100,000 labeled photos of dogs and has managed to get 96 percent accuracy at classifying different breeds. These tasks don’t seem related, but that doesn’t mean the dog breed classifier is irrelevant.

By training the model on a ton of images, the weights in the different layers of the dog breed classifier are well tuned to identify all kinds of useful visual features: edges, shapes, and different intensities of light and dark pixels to name a few. It’s possible that a number of these features might even help you identify the rare skin disease. In fact, to find out you’d just need to remove the final layer of the dog breed classifier (the one that selects which breed has the highest probability based on the input) and replace it with a new layer that makes a binary decision as to whether or not the skin disease is present.

An illustration of the basic flow of transfer learning

Promising Research Topic or Commercial Powerhouse?

There are many promising applications of transfer learning across research domains, for both computer vision and natural language processing (NLP) tasks. Still, you might be wondering whether it can really be used effectively in commercial settings. The short answer is that while it’s not a commercial powerhouse yet, it may well become one in the next few years.

Sebastian Ruder’s graph showing the future trajectory of ML in industry (with insight attributed to Andrew Ng’s 2016 NIPS tutorial)

So what’s a possible current commercial application? If you’re a consumer-facing company, you probably have a huge amount of data about different types of customers. But imagine you’ve now got a new group of customers using your product. Maybe you’ve expanded to a new country or built out new product features that are attracting a novel demographic. The behaviors of this new customer group are unlikely to be fully accounted for by the data you’ve collected previously.

Nevertheless, building a model with all the data you do have could still be beneficial, even if what you’re trying to predict about these new customer behaviors is unique. In other words, you can solve the cold-start problem (not having any labeled data about the new group of customers initially) by leveraging the previous customer data you’ve collected.

Returning to That Promising Research Aspect

Eventually a lot of the current research into transfer learning is going to have downstream commercial applications, so it’s worth briefly taking a look at a recent example.

In NLP, a specific type of transfer learning has become integral to most tasks over the last few years. Word embeddings are numerical representations (technically, multidimensional arrays) of words. These representations are usually trained on a large corpus of text (e.g., Twitter or Wikipedia) and can then be used for a variety of other NLP tasks (essentially, you just end up replacing each word in your input with its embedded numerical representation). This is useful because words that show up in similar contexts across the original corpus tend to be grouped together in the embedded space, in turn providing semantic and syntactic information.

Earlier this year, Google released a paper that set out to accomplish a similar outcome, but for sentences and phrases instead of words. This was already possible to some degree thanks to paragraph vectors, but the Google researchers specifically wanted to build models that targeted transfer learning tasks, and which could be used in a wide variety of settings (even going so far as to call their sentence encodings “universal”). Sentence encodings that work across multiple contexts and domains would be a huge boon for NLP research, which has so far proven slightly less amenable to transfer learning than computer vision.

A heatmap from Google’s Universal Sentence Encoder showing similarity scores between sentences

What This Means For You

Transfer learning isn’t necessarily a panacea for all your machine learning challenges. Still, if you’re trying to solve a new problem (or expanding an old problem to a new domain) and don’t have much data, it’s pretty much the best option currently out there.

Even in instances where you’ve collected a decent amount of data, the amount of time and effort involved in labeling enough data to train a model from scratch may be prohibitively expensive. So long as you’re attending to the possibility of negative transfer, there’s little risk in trying a transfer learning approach in these instances. In fact, it might even end up saving you a substantial amount of time or help you solve an otherwise intractable problem.