Unveiling Zero-Shot Learning

Ray Wang
8 min readMay 7, 2023

--

Introduction

Imagine you are a machine learning scientist tasked with creating a system that can identify and classify tweets based on emerging trends. Using the traditional machine learning framework, we would first need to first train our model using labeled data, with tweets as inputs and the corresponding trends as output labels. While this sounds like a simple task, implementing this in real time reveals a significant flaw. As the social media landscape is ever-changing, it is virtually impossible to constantly re-train a model that stays updated with the latest trends. Furthermore, the process of manually labeling the vast amount of Twitter data is both strenuous and time-consuming. This is where zero-shot learning comes to the rescue.

Zero-shot learning (ZSL) is a state-of-the-art machine learning framework that aims to create models that can infer knowledge about unseen classes by leveraging their knowledge of previously encountered labels. This “self-learning” property allows the model to generalize from limited training instances. In the context of Twitter data, a ZSL model can classify new tweets with unseen labels based on past data and trends. This overcomes the previous challenge of real-time prediction as it is highly adaptable and is able to predict previously unseen labels.

In this article, we will delve into the fundamentals of ZSL, explore how it works, and examine its potential use cases. We will primarily focus on Xian et al. ‘s 2020 paper, “Zero-Shot Learning — A Comprehensive Evaluation of the Good, the Bad and the Ugly” [1], which provides a holistic overview of the current landscape of ZSL research and introduces novel evaluation methodologies. Overall, we will embark on an in-depth exploration of ZSL and discover the intricacies that lie in the heart of this emerging field of machine learning.

Models and Algorithms

Here we will go through some of the major ZSL frameworks. We will explore their structures, functionalities, and potential drawbacks.

Attribute-based ZSL

In the early years of ZSL research, researchers have focused on “attribute-based” approaches, such as Direct Attribute Prediction (DAP). DAP involves a two-stage prediction process. During the first stage, the model predicts the inputs’ attributes; in the second stage, a separate model predicts the class label with the most similar set of attributes. For example, when given labeled dog images, the first model learns features such as the presence of four legs, fur, and a tail. During the second stage, if presented with an unseen image of a cat, the model can identify that the animal has similar attributes. With additional side information, such as a class-attribute association matrix, the model can predict the image as a cat even without prior exposure. However, this method has significant limitations, primarily due to the “domain shift,” where the intermediate attribute-prediction step does not align well with the final task of predicting labels.

In the first stage, DAP learns different attributes about various animals. This knowledge is then used to map an unseen class to the label that has the most similar set of attributes [2].

Embedding-based ZSL

Embedding-based ZSL addresses the “domain shift” limitation of attribute-based ZSL by directly mapping input and output spaces. For example, Attribute Label Embedding (ALE) first creates an image embedding space and a label semantic embedding space and finds a mapping function that connects them. The model can then use the mapping function to infer relationships between unseen input and label classes

In Embedding-based ZSL, we create an input image embedding using CNN and an output semantic embedding using Word2Vec or GloVe. We then find a mapping function that connects the input and output space. [3]

In general, embedding-based ZSL involves a training set:

and we want to learn:

by minimizing the regularized empirical risk:

Here, L(.) is the loss function and Ω(.) is the regularization term [1].

One major difference between ZSL and traditional ML framework lies in the mapping function:

Here F(x,y;W) is a compatibility function that measures the relationship between the embedding space of input x and that of output y. In other words, the goal is to learn the mapping weight W that maximizes the compatibility score. For example, ALE uses a linear compatibility function:

Here, θ(x) is the input image embedding and the φ(y) is the output semantic embedding. This is essentially a dot product of the input and the output embeddings. A higher dot product indicates a larger compatibility score.

Similar models that employs a linear compatibility functions include Deep Visual Semantic Embedding (DEVISE), and Structured Joint Embedding (SJE); however, those models fail to capture complex, non-linear relationships between the embedding spaces. To address this issue, researchers created models with non-linear compatibility functions, such as Latent Embeddings (LATEM). LATEM utilizes a combination of linear embeddings, allowing it to capture more intricate relationships and improving the performance.

Hybrid Models

So far we have seen that we can use both attribute information and latent embeddings to predict unseen classes, so why not combine the strength of the two models and utilize all the information we have. Hybrid models combine both attribute-based and embedding-based models by leveraging both attribute knowledge and semantic embeddings. These models utilize input attributes to extract fine-grain details. Meanwhile, they employ semantic embeddings to capture the relationship between different class labels. Examples of hybrid models include Semantic Similarity Embedding (SSE), Convex Combination of Semantic Embeddings (CONSE), and Synthesized Classifiers (SYNC), all of which aim to improve the performance of ZSL by combining multiple sources of information.

Transductive ZSL

Transductive ZSL methods, such as GFZSL-tran, aim to leverage additional information of unseen data to improve the model’s ability to generalize. For example, besides labeled images of cats and dogs, we also have access to unlabeled images of other animals, say fox. These unlabeled data can offer valuable insights into unseen classes, and improve model performance by extracting this latent information.

Compared to traditional ZSL, Transductive ZSL has access to unlabeled testing data [4].

Datasets

Some of the major datasets used in ZSL frameworks include Attribute Pascal and Yahoo (aPY), Animals with Attributes (AWA1), Caltech-UCSD Birds 200–2011 (CUB), SUN, Animals with Attributes2 (AWA2) dataset, and large-scale ImageNet. Among these, both aPY and AWA1 are coarse-grained, small to medium scale attribute datasets, in which images are labeled with specific attributes. CUB and SUN are fine-grained medium scale attribute datasets, and ImageNet is a large-scale, non-attribute dataset. AWA2 improves on AWA1 by including publicly available images.

Evaluation

When evaluating the ZSL model performance, an intuitive choice is the top-1 accuracy, which measures whether the predicted class is equal to the true label. However, this approach has the significant drawback that that the majority class will heavily influence the model performance while the performance of minority classes become insignificant. Therefore, the authors propose a per-class top-1 accuracy, described as the following:

This metrics measures the top-1 accuracy for each class and then averages across all the classes to get the final evaluation metrics.

In generalized ZSL, where both seen and unseen data are tested, the author suggests using the harmonic mean as the evaluation metrics. Unlike the arithmetic mean, the harmonic mean prevents the seen class from significantly affecting the final metrics. It is defined as the following:

Here acc_{y^{tr}} is the accuracy for seen images during training, and acc_{y^{ts}} represents the accuracy for unseen images.

Results

In the paper, the authors compare the performance of several state-of-the-art ZSL models, including DAP, DEVISE, GFZSL, and etc. The models are put up to test against prediction tasks using the datasets described earlier, including SUN, CUB, AWA1, AWA2, aPY, and ImageNet. The authors will test the models using both the ZSL framework and the generalized ZSL framework. The results show that embedding models like ALE and DEVISE generally outperform other models.

Model performance on each dataset. We use the per-class top-1 accuracy as described above as evaluation metrics. [1]
Model performance in generalized ZSL, which includes both seen and unseen data in the testing set. Here ts is model performance on unseen data, tr is model performance on seen data, and H is the harmonic mean described as above. [1]

Conclusion

In conclusion, zero-shot-learning is an innovative ML framework that allows models to predict previously unseen data without the need for retraining. Throughout this article, we have explored various ZSL models, such attribute-based, embedding-based, and hybrid models. Additionally we briefly touched upon some of the benchmark datasets and evaluation methods in ZSL research.

I hope that this article can help you with a fundamental understanding of ZSL and its inner workings. It is a relatively novel field in machine learning research and new algorithms arise every year. With its high adaptability in prediction, ZSL can tackle numerous obstacles faced by the ML world today. ZSL offers a wealth of applications, especially in NLP and image classification. The next time you are looking to develop a tweet trends classifier or explore a similar project, I hope that ZSL can prove to be a valuable resource in your endeavor.

References

[1] Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., & Schiele, B. (2020). Zero-shot learning — A comprehensive evaluation of the good, the bad and the ugly. *arXiv preprint arXiv:2003.04394.*

[2] Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2013). Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465.

[3] Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2015). Label-Embedding for Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7), 1425–1438.

[4] Sun, X., Gu, J., & Sun, H. (2021). Research progress of zero-shot learning. Applied Intelligence, 51(2), 1–15. **[https://doi.org/10.1007/s10489-020-02075-7](https://doi.org/10.1007/s10489-020-02075-7)**

--

--