Transfer Learning — Intuition to Implementation

9 min readSep 29, 2022

The next big thing in Machine Learning!!

As humans, we have this inherent ability to transfer knowledge across tasks. What we acquire as knowledge while learning about one task, we utilize it further to expedite learning new, but related tasks. For example:

If you already know how to ride a bicycle, you can learn how to drive a motorcycle more easily with some fine-tuning on controlling the engine power, or
If you know how to write by hand, you can learn how to type it from the keyboard, by fine-tuning your brain to learn the location of keys.

So, as humans, we do not learn everything from scratch when we attempt to learn a new task or skill. Rather, we transfer and leverage our knowledge from what we have learnt in the past. And that’s precisely what Transfer Learning is, in the context of us humans.

In the context of Machines, Transfer Learning is storing knowledge gained solving one problem, and applying it to a different but related problem. A typical example would be to reuse a model trained for autonomous driving of cars — for a related task of autonomous driving of trucks, with some fine-tuning to account for the vehicle size and weight.

Brief on this learning series..

Well, this article is actually the first instalment of a three part learning series, where we shall:

Understand the intuition behind Transfer Learning,
Deep-dive into Google’s BERT Model — which has achieved superhuman performance in its language understanding, and finally
Train (fine-tune) a Fake News Detection Model, by Transferring Learning from pre-trained BERT Model

Now, let’s continue further on this first part, which is, Transfer Learning Intuition.

Watch the video tutorial instead

If you are more of a video person, go ahead and watch it on YouTube, instead. Make sure to subscribe to my channel to get access to all of my latest content.

Machine Learning has come a long way

Based on the exciting work that has happened in the Machine Learning Ecosystem in the last decade, we now have the ability to train highly accurate ML models. And in fact, we are now at a stage that for many tasks, state-of-the-art models have reached a level where their performance is so good that it is no longer a hindrance for users. For instance:

The newest ResNet Model, trained on the Stanford’s ImageNet dataset, achieved superhuman performance at recognising objects,
Google’s Bert Model, trained on the whole Wikipedia Database & Google Books, also achieved superhuman performance at understanding the English language,
Speech recognition error has consistently dropped and is more accurate than typing as this point

Source: https://towardsdatascience.com/overview-state-of-the-art-machine-learning-algorithms-per-discipline-per-task-c1a16a66b8bb

And really, the list goes on. This level of maturity has enabled the large-scale deployment of these models to millions of users and widespread adoption. Majority of these superhuman models are publicly available today, for small businesses to bootstrap on their big ideas and aspiring data scientists like you, to perform live hands-on and learn.

As next steps, there’s a genuine need to efficiently use these models’ knowledge into lightweight applications. And hence, Transfer Learning is attracting bright minds to solve the challenges that are looking at us.

Why Transfer Learning is considered the next ML frontier?

Andrew Ng, who is a renowned stanford professor, AI scientist and Co-founder of Coursera, delivered a famous lecture in 2016 at the Annual ML & Neurosciences Conference NIPS, where he mentioned:

Well, it is indisputable that Machine Learning use in industry has so far been driven mostly by supervised learning. However, supervised learning has its limitations in that it demands massive amounts of data, which is expensive in both time and compute resources.

And not just that, large machine learning models also have an environmental impact, which gives us enough reasons to believe that there’s a need to figure out ways to democratize the world of Machine Learning, where large pre-trained models are shared in the open source and reused, resulting in: reduced overall compute cost and carbon footprints.

**Source:** https://huggingface.co/blog/bert-101?text=Earth+can+be+saved+if+humans+%5BMASK%5D.

Now, when we have established the fact that Transfer learning is the need of the hour, at this point, let’s try and understand how Transfer Learning actually works..

How Transfer Learning works?!

In the classic supervised learning scenario of machine learning, if we intend to train a model for some Task A, we assume that we are provided with labeled data for the same. We then go ahead and train a Model A on this dataset and expect it to perform well on unseen data. Let’s say the task here is to detect pedestrians in day-time images.

**Source:** https://ruder.io/transfer-learning/index.html#fn1

On another occasion, for some other Task B in the same domain of Object Detection, we again require labeled data to train a new Model B. Let’s say the task this time is to detect pedestrians in night-time images. Here, this traditional supervised learning paradigm breaks down if we do not have sufficient labeled data to train a reliable Model B.

Well with Transfer learning, we may reuse the knowledge gained in solving Task A — which is to detect pedestrians in day-time images and apply it to Task B, which of course, is to detect pedestrians in night time images, requiring just a small dataset for this re-training.

High-level architectural flow

As an example here, let’s say we have previously trained a CNN model for Task A, which is, vehicle image classification into these multiple categories: car, truck, bicycle, etc. And now, we intend to build a model to just predict the binary classes Car & Truck as Task B.

**Source:** https://medium.datadriveninvestor.com/introducing-transfer-learning-as-your-next-engine-to-drive-future-innovations-5e81a15bb567

Based on our understanding, we know that CNN's typically aim to detect edges in the first layer, shape or form in the middle layer, and task-specific features in the last layer, also called the model head.

In transfer learning, the first and middle layers from Model A are used as-is, and the Final layer or the Model head is the one which is re-trained or fine-tuned. Because the model has already been trained to recognise objects in the earlier levels, we simply retrain the subsequent layers to understand what distinguishes Car from a Truck. This way, we are making use of the labeled data from the source task A on which it was trained.

Why everyone loves Transfer Learning?

Transfer learning offers a number of advantages, the most important of which are reduced training time, improved performance (in most circumstances), and the absence of a large amount of data.

The best part is that a highly accurate model can be generated with fairly little training data using transfer learning, as the model is already pre-trained on a huge source dataset.

Brief on — Types of Transfer Learning

There are different transfer learning strategies and techniques, which can be applied based on the domain / task at hand, or the availability of data. These are: Inductive Transfer learning, Transductive Transfer learning and Unsupervised Transfer Learning.

In case the source and target domains are the same, yet the source and target tasks are different from each other, we use Inductive Transfer learning
If there are similarities between the source and target tasks, but the corresponding domains are different, then we use Transductive Transfer Learning. In this setting, the source domain has a lot of labeled data, while the target domain has none.
Unsupervised Transfer Learning is also very similar to inductive transfer, with a focus on unsupervised tasks in the target domain. Here, the source and target domains are similar, but the tasks are different.

Major applications of Transfer Learning

Transfer Learning has numerous use-cases across NLP, CV & Speech Recognition:

NLP is, one of the most appealing transfer learning application, as it solves cross-domain tasks by leveraging the knowledge of pre-trained AI models that understand linguistic structures. Deep learning models such as BERT, TensorFlow Universal Model, etc., are used in everyday NLP tasks, like: next word prediction, question answering and machine translation
In computer vision, Transfer learning is commonly used in image recognition, object detection, image noise removal, and other image-related tasks as all image-related tasks require basic knowledge and pattern detection of familiar images
And in speech recognition, when we say “Alexa” or “Hey Google!”, the primary AI model — developed for English speech recognition — is busy at the backend processing our commands.

By the way, in the next part of this learning series, I’ll be demonstrating the insane capabilities of the NLP BERT Model to you, which has shown super-human performance in its language understanding. I’m telling you, it will freak you out.

Transfer Learning Implementation

Finally, now let’s go over how transfer learning works in practice.

The first step is to decide the pre-trained model — to use as the base, depending on the task. To be compatible, transfer learning requires a strong correlation between the knowledge of the pre-trained source model and the target task domain. For example, the Fake News Detection Model that we shall be building in the third part of this learning series, we shall require a pre-trained Language Model that has solid linguistic understanding. And we shall be using the BERT Model for this task.
As step 2, we make a base model, basically the one we chose in the first step and initialize its network weights.
As step 3, we freeze the weights on the starting layers from our pre-trained model. If we don’t do this, we lose all of the previous learning.
Then we create new Trainable Layers. Generally, feature extraction layers are the only knowledge we reuse from the base model. To predict the model’s specialized tasks, we must add additional layers on top of them. Additionally, we define a new output layer, as the final output of the pre-trained model will almost certainly differ from the output we want for our model.
As the last step we fine-tune our model to improve performance. Fine-tuning entails unfreezing a portion of the base model and training the entire model on the new dataset again at a very low learning rate. The low learning rate improves model performance while preventing overfitting.

Conclusion

Well, this completes our first tutorial on overview of Transfer Learning.

I’m already so excited about the next tutorial where I’ll introduce you to the pre-trained BERT model, which was developed by researchers at Google in 2018, and has beaten human standards in comprehending the English language. We shall be using this pre-trained BERT Model to build a Fake News Detection Model with Transfer Learning, later in the third part in this series.

Brief about Skillcate

At Skillcate, we are on a mission to bring you application based machine learning education. We launch new machine learning projects every week. So, make sure to subscribe to our youtube channel and also hit that bell icon, so you get notified when our new ML Projects go live.

Shall be back soon with a new ML project. Until then, happy learning 🤗!!