Multimodal Meta Learning with Siamese Network — Metric Space Method!

Juhi Purswani
tech@iiit-gwalior
Published in
8 min readJun 12, 2020

Understanding metric space methods of meta learning and experimenting them with multimodal data!

The conventional notion of deep learning models is to process a certain type of data in a huge amount to complete a certain task. Be it recognizing the speaker from an audio or identifying a person from its facial features. But do we humans process in a similar manner?

We see visuals, hear sound, read text to understand the world around us. We identify a person by face, voice, physique etc. Our brain relates these different information sources and teaches us new tasks even with small simulations. Hence we interact with the world in a multi-modal manner and can learn new things with less data as well. We don’t really need chunks of data for every process.

How can deep learning architectures learn to understand the world in a more human way?

The idea is to train neural nets which can incorporate multi-modalities and can predict with less amounts of data. To perform this task, a metric based method of meta learning can be used — Siamese Network!

Lets understand every concept in some detail! But before diving further do check these prerequisites-

  • Basic pre-processing techniques of audio data
  • Classification techniques with deep learning models.

Multimodality Learning — Human Inspired Learning:

As stated — we interact with the world in a multi-modal manner. Each way of this interaction: text, image, audio refers to a modality.

Combining these heterogeneous information sources to train neural nets refers to Multimodal Learning. Here, instead of using labelled uni-modal input data, pairs of different modality data are used. Hence the training data set can be represented as-

where m is the number of combined modalities, L is number of classes, K is number of multi-modal data pair points

One method of combining these modalities is the equi-weighted multimodal fusion method.

The method can be implemented with two subtasks-

  • Feature extraction: Using deep learning models to extract the individual features from the modalities as its commonly done in uni-modal data. The architectures of individual feature extractor could be laid according to the need.
  • Aggregate Network: Using a common neural net architecture, the features extracted in the above step can be concatenated. An appropriate activation function needs to be used to calculate the probabilities of the labels from these fused features.

This method is called as equi-weighted multimodal fusion as it gives equal importance to every modality.

Meta learning :

Undoubtedly deep learning models have given outstanding results in image, audio and text related classification. But if we think of their implementation on industrial level it feels quite unrealistic. Lets say a company wants to use a facial recognition model for the identification of its team members. What would the deep learning approach be?

Conventionally, it would be to gather various images of people working in the company and then train the model. But what if a new member joins the company? For him to get identified by the model, re-training with the updated data with his images will be required. Hence the model becomes rather static.

Here is where meta learning comes in. Meta learning produces a versatile AI model that can learn to perform various tasks without having to train them from scratch. It does not learn directly through data rather it stimulates the process of learning, learning to learn!

Metric Space Method:

One method of implementing meta learning is the Metric Space Method. Using this method a model can be trained to identify if the two given data points belong to the same class. For this deep learning nets are used to extract features from the data and a similarity is computed between the pair of data points.

Siamese Network:

It is one of the widely used metric spaced meta learning algorithms. Its objective is to predict if the input data pair is similar or not.

It comprises two identical neural networks with weight sharing property and an Energy Function. The neural network extracts features and generates embedding vectors from the data pairs individually and energy function calculates the difference between the extracted embeddings, finally predicting if the input pair is the same or not.

Hence rather than directly learning which data point belongs to which class, this algorithm learns to identify if the given data points belong to the same class. Considering the above facial identification task, using the Siamese network new team members could easily be identified without any re-training.

Enough of theory right? So now let’s get our hand dirty! Let’s start with formulating our task and then proceeding with the coding part💻.

Our task:

Our task is to combine these concepts of meta learning and multimodal data to train a classification deep learning network. For the modalities we will be using audio and image data. Hence the input will be the pair of image-audio data. Two of these image-audio pairs will be provided as input to predict if they belong to the same class.

Hence the final architecture will be-

  1. A neural network for image and audio data will be laid individually to extract the features.
  2. These extracted features are then fed into a fusion network for aggregation, hence a multimodal network is laid.
  3. Two copies of this above network with sharing weights will be used to generate the embedding vectors.
  4. Distance between these vectors will be calculated using an energy function to predict if the input pair belong to the same class or not.

Dataset:

We will be classifying digits using images of handwritten digits from MNIST dataset and spoken digits from “free spoken digit dataset”. In the audio data sound clips are available in .wav format and can be downloaded from here.

For the audio part pre-processing can be done by generating the spectrogram of the .wav files. Python library ‘librosa’ is used.

Above method returns the spectrogram of a given audio file and plots it using the ‘matplotlib’ library.

Energy Function and Loss Function:

To calculate the distance between the embeddings generated, euclidean distance will be used as energy function-

Contrastive loss:

It is used as a loss function. It is generally used when similarity between two points are required to be calculated, hence serving the purpose. It involves the concept of a parameter ‘margin’. Its function is to not penalize the model if the true label is 0( data points are dissimilar ) and distance between the embeddings are greater than this margin.

The model:

The implementation of the model architecture is divided into four parts:

  1. Image Feature Extractor: It consists of 2D Convolution layers followed by the Max pooling layer. ‘Relu’ is used as activation function. Dropout layer is used to avoid over-fitting.

The function takes image dimensions as input and returns an image feature tensor.

2. Audio Feature Extractor: It also contains Convolution layers where we use max-pooling layer to do down sampling.

💡 Tip: We need to concatenate these features, hence adjust the parameters in a way that the output dimensions are same except for the concat axis, in our case it is 2.

3. Fusion: To aggregate the image and audio features convolution layer along with Max Pooling are used. Dense layer is used to get the linear embedding. This fusion will return the embedding of the input image-audio pair which will be then used to calculate the distance .

4. Siamese Network: This will use the embeddings from the fused layer. As stated above the energy function here will be euclidean distance between the two embeddings. The less the distance the more will be the probability of two image-audio pairs belonging to the same class. Sigmoid activation function is used to calculate these probabilities.

Let’s understand this code step by step-

Firstly we need to create inputs for the two image audio pair data — input_dim_img and input_dim_aud are the dimensions of image and audio data respectively.

img_a = Input(shape=input_dim_img)
img_b = Input(shape=input_dim_img)
aud_a = Input(shape=input_dim_aud)
aud_b = Input(shape=input_dim_aud)

Next, the two image feature extractor and audio feature extraction methods will be called to extract the features from the data-

img_network = image_feat_network(input_dim_img)
feat_img_a = img_network(img_a)
feat_img_b = img_network(img_b)
aud_network = audio_feat_network(input_dim_aud)
feat_aud_a = aud_network(aud_a)
feat_aud_b = aud_network(aud_b)

To use the fusion network we first need to concatenate features of image audio pair-

concat_a = Concatenate(axis=2)([feat_img_a, feat_aud_a])
concat_b = Concatenate(axis=2)([feat_img_b, feat_aud_b])

The aggregate network is now called using these concatenated vectors with appropriate dimensions. Two copies of the same network will be used to have the shared weights.

base_network = multi_modal_network(input_concat_dim)
feat_vecs_a = base_network(concat_a)
feat_vecs_b = base_network(concat_b)

For the custom euclidean layer, Lambda will be used -

distance = Lambda(euclidean_distance, output_shape = eucl_dist_output_shape)([feat_vecs_a ,feat_vecs_b])
prediction = Dense(1,activation='sigmoid')(distance)

After the architecture is laid, compile the model to see the summary of the network.

opt,model = siamese_model (input_dim_img ,input_dim_aud , input_concat_dim)
model.compile(loss=contrastive_loss, optimizer=opt)
model.summary()

Configuration of the model:

Train the above neural net with this configuration-

batch_size = 64
epochs = 30
input_concat_dim = (27,37,32)
input_dim_img = (56,56,1)
input_dim_aud = (1025, 47,1)

💡Tip: While training the neural net it is always advisable to use Early Stopping so as to avoid over-fitting.

The complete code for this multi modal meta learning can be found at my repository !

Conclusions:

Training model with just a few data points makes the model more dynamic and easy to implement. It makes our model more industrial. But there is still room for the improvement.

  1. The equi-weighted fusion method does not work well since in real life scenarios not all modalities have equal importance. It may vary from task to task. Hence a weighted fusion method could be used, according to the need of the problem statement.
  2. Small size image feature extraction becomes difficult. We may increase the resolution of the dataset before using it for training purposes.

All criticisms and suggestions are welcome.

Happy Learning!

--

--