Few Shot Learning — A Case Study (3)

Published in

Analytics Vidhya

4 min readJul 19, 2020

In the previous article, we dive deep into a few-shot classification via Relation Network. Moreover, we analyzed the Relation Network for Image classification tasks. In this article, I will be analyzing the Relation Network for text classification. Here, I will perform extensive experiments to evaluate the effectiveness of Relation Network with various embedding networks.

The flow of this article:

Revision on Relation Network
Different types of possible embedding networks
Results & Analysis
Conclusion

Revision

Before moving forward let’s quickly revise the Relation Network (for more details please read the previous blog):

Relation Network contains two sub-networks.
The first one is the embedding network, which will extract the underlying representation of each input irrespective of the class it belongs to.
The embedding network will be used to extract the features from support and query sets of data.
And the second network will compare each the embeddings from support set with a query set and will give the result based on this comparison.

Text embedding networks

Whenever someone hears about word embeddings, word2vec, GLoVe, and Language models (i.e., BERT, RoBERTa, etc.) are the key technics that comes into the mind. Therefore, in this section, I’ll be using the state-of-the-art method, namely BERT, as transfer learning and pre-trained base model for designing various embedding networks for a few shot text classification. Moreover, I will analyze these possible networks in terms of accuracy and computational complexity. [GitHub]

BERT is one of the best easily available and popular language models. In the base architecture of BERT, there are a total of 12 layers and each of these layer outputs can be used as word embedding. All different types of possible word embedding that can be extracted from BERT is shown in below figure:

Figure (1): Different types of possible word embedding. Source: http://jalammar.github.io/illustrated-bert/

Now, to experiment the Relation network with different settings, I will be using BERT to design several different embedding networks to find out the best suitable embedding extraction method for a few-shot text classification.

Only pre-trained BERT: Here, the sentence embedding will be extracted by taking the average and max pool of each word representation taken from the last layer of pre-trained BERT.
Fine-tuning BERT itself: In this method, instead of relying upon pre-trained BERT in method 1, we will fine-tune BERT according to the few-shot setting and dataset.

Although we have SOTA method (BERT) for extracting word embeddings, there are two major problems associated with it. First, the high computational complexity of BERT. Second, the ability to modify word representation from the core level. To avoid these problems, we can introduce one top layer using which we will modify word embeddings according to the requirement. Therefore, we will be performing an additional two experiments:

Pre-trained BERT + BiLSTM: Here, we will apply Bidirectional LSTM on top of pre-trained BERT to modify the word embedding in such a way that single output from LSTM can capture the relevant information from BERT for text classification.
Pre-trained BERT + BiGRU: This experiment is the same as the above one. However, we just replace BiLSTM with BiGRU. Here, in an ideal case scenario, GRU should outperform LSTM by a little margin with less trainable parameters.

Figure (2): Architecture design for pre-trained BERT + LSTM/GRU based experiments.

Results & Analysis

To perform the above-mentioned experiments, Kaggle news category dataset has been used. In experimental settings, 50% of news categories are used for training and 20%, and 30% of categories are used for validation and testing purposes. Moreover, this analysis is performed with 5 way 2 shot classification settings.

Table (1): Accuracy and computational complexity of RelationNetwork on different experiments.

From the above table, we can observe that fine-tuned BERT outperforms pre-trained BERT by a large margin. Moreover, adding BiLSTM and BiGRU on top of BERT improves the result quite a lot. Furthermore, we can see that BERT has the highest computational complexity, and the reason behind its poor performance is the pre-training strategy used. Because of that BERT needs to modify its huge number of parameters. However, LSTM/GRU extracts the required information from the word embedding for a few-shot text classification.

Conclusion

Following are the few conclusion points to keep in mind:

Embedding network is the key component even for few-shot text classification.
Fine-tuning BERT improves the results with the cost of computational complexity.
Moreover, LSTM/GRU based layers on top of pre-trained BERT significantly outperforms the only BERT based embedding network with very less computational complexity.

Implementations regarding all of the above experiments alongside the different result plots are provided in the GitHub repository.

Next, I will be implementing the same Relation Network for speech classification tasks. So stay tuned! Subscribe here to get notification of upcoming new articles to stay put with current research trends.

References:

Sung, Flood, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M. Hospedales. “Learning to compare: Relation network for few-shot learning.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. 2018.
Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).