Part 2: Deep Learning for Engineers

Introduction to Transfer Learning

Published in

Bumble Tech

6 min readMay 7, 2020

In this article, we use the concepts we saw in part 1 article to unleash the power of neural networks and the reason for their current popularity, namely their reusability. We learn how we can (partially) automate the feature engineering part and how this opens a new world of possibilities and applications that go way beyond classification tasks. We will reuse all the concepts we have learned in this article.

In particular, we will explore:

Feature Extraction: how neural networks extract features from inputs;
How to use the embedding, which is how NN encode the input
Glove, a library for processing text that has some pre-trained embedding for use.

This will give you a practical understanding of how transfer learning can be used to solve a Natural Language Processing problem without the need for a massive amount of training data.

Neural Networks vs Classic Machine Learning

There is a fundamental difference between the way humans learn and the way machines do. Our clear advantage as humans is our ability to transfer knowledge between different domains. We have techniques that make our models learn tasks, such as image recognition, but generalisation is still a challenge. In this article, we see how we can get one step closer to algorithmic generalisation.

Introduction to Convolutional Neural Networks

Convolutional Neural Networks are a network architecture that relies on local proximity of related features. This way we can work on a small part of the feature set at the time because distant features will not be relevant. For example, if you want to recognise a person by their picture we know that all features that describe a face will be close together (eyes, nose, mouth etc.) therefore we only need to compute a small portion of the image at the time. You can find more details in our previous post (link).

The CNN is smaller than a fully-connected network, as it does not need to find the weights connecting inputs that are distant to each other, but it can still require a lot of time to train.

Two types of Transfer Learning

As we have said in the previous post, hidden layers of neural networks can be used to automatically build features on top of each other. Automatically building features from raw data in this way is quite efficient. For example, if we want to build an image classification algorithm that classifies different types of artwork, we don’t need to hand-craft the features; we can outsource this process to the network.

This also has the advantage of creating a different and compact representation of the main features that are important in our task. These can potentially be used to algorithmically generate new instances of the object of our classification. In our example, we would be able to generate realistic artwork.

Multi-task learning

Multi-task learning (MTL) is a type of algorithm in which a model is trained on different tasks simultaneously. In this way, the network creates weights that are more generic and not tied to a specific task. In general, these models are more adaptable and flexible.

With this task, we will have a few initial shared layers.

The assumption is that training a network on different tasks will enable the network to generalise to more tasks, maintaining a good level of performance. It will achieve this by creating some layers that are more generic.

Feature extraction

Another simpler but usually less effective way of doing TL is to use a network trained on a specific task as a feature extractor. Choosing this method, the feature we will be highly dependent on the task.

But we also know that the features created in different layers follow a hierarchical structure that will learn a high-level representation of the image in 3 different layers:

Lower layer: features are very low-level, i.e. they are quite generic and simple. Examples of lines, edges, or linear relationships. We saw previously how, with one layer, linear relationships can be described.
Second layer: able to capture more complex shapes, such as curves.
Higher layer: features are high-level descriptions of our inputs. Some of these features might be too specific to one task to be reused, in which case they will need to be discarded, or that layer will simply need retraining.

Example

A very common example of transfer learning is embedding in algorithms like word2vec. In this case, a shallow neural network is trained on a simple task, predicting the likelihood of a word in the light of the words around it. The tasks per se are not important but rather to train the network and use the representation from the network for another task.

Let’s see, for example, how a pre-trained representation can be used to classify new groups by topic. For these tasks, we will use Glove.

import numpy as np
import os
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

We can decide how long the representation will be: Glove gives four vectors with lengths of 50, 100, 200 and 300.

# Deciding which embedding to use
possible_word_vectors = (50, 100, 200, 300)

We will use the smallest vector representation, with length 50.

word_vectors = possible_word_vectors[0]
file_name = f’glove.6B.{word_vectors}d.txt’
filepath = ‘../data/’
pretrained_embedding = os.path.join(filepath, file_name)

Now, for each word, we find the embedding representation and we index it.

embeddings_index = {}
with open(pretrained_embedding, "rb") as f:
    for line in f:
        values = line.split()
        word = values[0].decode("utf-8")
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

Let’s say we want to classify the text as either atheist or related to science fiction.

# Getting the data
cats = [‘alt.atheism’, ‘sci.space’]

To try a multiclass problem it’s possible to use:

cats = [‘alt.atheism’, ‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’]

We divide it into a test and training set.

newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)
X_train = newsgroups_train['data']
y_train = newsgroups_train['target']
X_test = newsgroups_test['data']
y_test = newsgroups_test['target']

And now we can see the performances:

class EmbeddingVectorizer(object):
"""
Follows the scikit-learn API
Transform each document in the average
of the embeddings of the words in it
"""
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = 50
    
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        """
        Find the embedding vector for each word in the dictionary
        and take the mean for each document
        """
        # Renaming it just to make it more understandable
        documents = X
        embedded_docs = []
        for document in documents:
            # For each document
            # Consider the mean of all the embeddings
            embedded_document = []
            for words in document:
                for w in words:
                    if w in self.word2vec:
                        embedded_word = self.word2vec[w]
                    else:
                        embedded_word = np.zeros(self.dim)
                        embedded_document.append(embedded_word
                        embedded_docs.append(np.mean(embedded_document, axis=0))
        return embedded_docs

Now we can transform each document in the embedding average of its word.

# Creating the embedding
e = EmbeddingVectorizer(embeddings_index)
X_train_embedded = e.transform(X_train)
# Train the classifier
rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
rf.fit(X_train_embedded, y_train)
X_test_embedded = e.transform(X_test)
predictions = rf.predict(X_test_embedded)

We can then check the Area Under the Curve (AUC) and the decision matrix. The AUC tells us more about our binary classifier: an AUC of 0.5 means that the results are random; the maximum score is 1, and so this score of 0,74 is a decent result.

The confusion matrix tells us how many True Positive (224) and True Negative (306) we had. They are the correct classifications that our algorithm made, while the False Positive (88) and False Negative (95) constitute the mistakes.

print('AUC score: ', roc_auc_score(predictions, y_test))AUC score: 0.7405204936377006
confusion_matrix(predictions, y_test)
Out[268]:
array([[224, 88],
[ 95, 306]])

Conclusion

Despite this being a simple approach we achieved some very acceptable performances, with an AUC of 0.74 and a decent confusion matrix, without having to train a model from scratch — all thanks to transfer learning.

The real power of neural networks is that we manage to automate at least part of the feature engineering. This allows us not only to re-use networks for different tasks but it also opens a lot of additional possibilities that we are only just beginning to discover.

It is one solid step further towards Artificial General Intelligence, systems that learn and are able to transform the knowledge they acquire to different tasks. There is a long way to go to achieve Artificial General Intelligence but we have made a huge step forwards in recent years and transfer learning, the topic of our next post, is a key part of that.