One Shot Learning Using Keras

Published in

The Startup

7 min readJun 7, 2020

Image Source: https://github.com/brendenlake/omniglot

Abstract

For a neural network to learn features from images in order to classify them we need data, lots of data. It is difficult for a model to learn from very few samples per class.

MNIST dataset has nearly 60000 training images for numbers 0–9 (10 classes).

We will implement One shot learning to build a model which will correctly make predictions given only a single example of each new class.

Background

As humans , when we are presented with new object , we quickly pickup patterns , shape and other features. When we are presented with same kind of object in future , we quickly recognize that object. We all can relate to such instances. Suppose you saw a blender for the first time in your friends kitchen. Later when you see it at any store , you quickly recognize that its a blender (regardless of brand), not any other equipment.

Despite the fact that the person has only seen this object once in their life, he still differentiate that object from others. This is known as one shot learning.

prediction using Images (left) and speech (right)

To quote from this paper -http://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf

One particularly interesting task is classification under the restriction that we may only observe a single example of each possible class before making a prediction about a test instance. This is called one-shot learning

Image Classification vs One shot Learning

To understand the difference , consider a Face Recognition System designed for a company with 10 employees. Now, for this we can use traditional CNN approach to build a model, by collecting enough images for each person(class) and fit a classification model, that will precisely predict the person given an image. When implementing face verification , where we match the face of the person with the one in database, we need to make sure that the both images are of same person.

This approach will have few issues:

If a new employee joins the firm, then we need to go through the same process again , and re-train the model. This is same when an employee leaves the company.
As the number of employees increase, the dataset will increase and it will be difficult to fit the model.
We may not get sufficient samples for each class every time. Ideally a new employee may only give 1-2 images while joining.

One shot learning tackles this issue.

Given pair of images, the model predicts the degree of similarity between them.

Rather than predicting the class of image , we will try to predict the degree of similarity between pairs of image. This is done by representing images in N-Dimensional embedding vector which the model will generate. When these embedding vectors are represented in 2-D space , the distance between embedding of similar images will be less . Hence, the model predicts whether the images are similar or not.

In Face Recognition this is done in 128-D embedding vector

For further information on face recognition using one-shot , watch Andrew Ng’s Video.

Few applications of one shot learning:

Face Recognition , learn more here.
Drug Discovery, learn more here.

Problem Statement

Our model is given a tiny labelled training set S, which has N examples, each vectors of the same dimension with a distinct label y.

Content source: https://github.com/sorenbouma/keras-oneshot

It is also given x_test, the test example it has to classify. Since exactly one example in the support set has the right class, the aim is to correctly predict which y ∈ S is the same as x_test’s label.

Image source: https://github.com/sorenbouma/keras-oneshot

But the problem becomes challenging when N increases. Now we have to compare our test image with N different images and look for the highest probability for the correct class.

Dataset

I have used Omniglot dataset by Brenden Lake.

It contains 1623 different handwritten characters from 50 different alphabets ( or languages). Each of the 1623 characters was drawn online via Amazon’s Mechanical Turk by 20 different people.

The Omniglot data set contains 50 alphabets. We split these into a background set of 30 alphabets and an evaluation set of 20 alphabets. Only the background set will be used to learn general knowledge about characters (e.g., feature learning, meta-learning, or hyperparameter inference).

One-shot learning results are reported using alphabets from the evaluation set.

To know more about Omniglot dataset, check out Brenden Lake’s repo here.

Same character but drawn by different people

Dataset has been divided into Background Data / Training Data(60%) — 30 Alphabets and Evaluation Data / Test Data (40%) — 20 Alphabets.

Siamese Network

To implement one shot learning, we will employ large Siamese convolutional neural network.

Siamese Neural Network (also called twin neural network ) uses the same weights while working on 2 different input vectors to compute 1 output.

Due to weight sharing the siamese network converges faster when compared to 2 separate models.

As you can see the network will generate 4096-dimensional feature vector for each image input, which merges to 1 fully connected layer after calculating the Euclidean distance between these vectors. This layer uses sigmoid activation to squish the output probabilities between [0 and 1].

Goal

The model will learn a similarity function to find the degree of similarity between images. If they are similar , then they belong to same class, if not they don’t belong to same class.

Cost Function

Now that the problem has been converted to a simple logistic regression, we will use binary cross entropy as the cost function to minimize the logistic loss.

Binary Cross Entropy / Log loss function

Training

You can check the code here.

Image Pairs

Training Set has 964 classes with 20 samples in each class. Model needs 2 images to predict the output . If we train our model on every possible combination of image pair , this will be over by 185,849,560 possible pairs.

We can reduce the number of image pairs. Since every class has 20 samples, say E, that makes E(E-1)/2 image pairs (same) per class .

If there are C classes that makes C* E(E-1)/2 image pairs in total. These images represent same-class image pairs. For 964 classes , there will be 183,160 same image pairs.

We still need different-class image pairs. The siamese network should be given a 1:1 ratio of same-class and different-class pairs to train on. So, we will sample 183,160 image pairs (different-class) at random.

2. Image Augmentation

To achieve higher accuracy and avoid overfitting, we use Image augmentation using Scikit- Image library.

3. Data Generator

Training Set contains — 2 * 183,160 pairs of image . We cannot load all this data into memory at once. So, we use Data Generator to load and apply image augmentation for every batch at runtime.

4. Model architecture

This architecture represents Siamese Network.

Testing: N-way one shot Classification

In order to evaluate this model, we will perform N-way one shot classification. Here N implies how many images will be there in the support set (as discussed above).

Say, there is 1 image for evaluation, N images of different class in support set, and exactly one image is from the same class as the test image for evaluation. Model should predict high probability for the pair of images which belong to same class and low probability to remaining N-1 pairs.

Visualizing Support Set