From Zero to Hero: Shaking Up the Field of Zero-shot Learning

This article is part of the Academic Alibaba series and is taken from the paper entitled “Transductive Unbiased Embedding for Zero-Shot Learning” by Jie Song, Chengchao Shen, and Mingli Song, Yezhou Yang and Yang Liu. The full paper can be read here.

Alibaba and its research partners are shaking up the field of zero-shot learning (ZSL) with its novel quasi-fully supervised learning (QFSL) model, with tests indicating a considerable outperformance of other models. In machine learning, zero-shot learning refers to the process by which a machine learns how to recognize objects in an image without any labeled training data to help in the classification. In other words, ZSL aims to help machines categorize objects that they have never seen before. Naturally, this poses a huge challenge for developers. Imagine, for example, trying to identify a snake in a photo without ever having seen one before. While this might seem an impossible task, if the machine is fed a detailed description of a snake — long, legless, scaly — then it is capable of quickly and accurately recognizing the object. Essentially, this is how ZSL operates.

Avoiding Bias

One issue that affects many ZSL approaches is bias. Because the collection and labeling of training data is both labor-intensive and expensive, and because it remains difficult to gather enough statistically diverse training images (particularly for rare categories, such as an endangered species), unlabeled target classes (i.e. images or objects that have not been seen before), are often categorized as labeled source classes, which results in a poor performance in generalized settings.

Bias towards source classes in a semantic embedding space

When there are few training images available, or indeed none, existing object recognition models struggle to make correct predictions, and ZSL was developed principally as a means to combat this growing problem.

QFSL — New Solution

To resolve this issue, the Alibaba tech team has developed a straightforward and effective transductive method — quasi-fully supervised learning, or QFSL — which assumes that both the labeled source data and the unlabeled target data are available during the training.

Overall architecture of the QFSL model

Most ZSL methods map input images to fixed anchor points in the embedding space during training, but the QFSL method also allows mapping between the input and other points. The labeled source data is projected to the points specified by the source class in the shared semantic space, building a relationship between the visual and semantic embeddings. Meanwhile, the unlabeled target data is projected to other points, helping to alleviate the problem of bias.

QFSL owes its name to its similarities with conventional fully-supervised classification, in which a multi-layer neural network and classifier are integrated together. In the training phase, QFSL recognizes data from both source and target classes, even if there is no labeled data for the target class. This feature is advantageous, as any available labeled data of a target class can be used in the future to train the model.

Looking Ahead

When tested against various benchmarks, QFSL outperformed existing state-of-the-art ZSL methods by a wide margin in both generalized and conventional settings.

The best result is marked in bold, with second best in blue: QFSL clearly outperforms other methods

These promising results leave the door open for further research, and Alibaba and its research partners are investigating how other aspects of the semantic space, such as word vectors, can be exploited to influence results. Inductive ZSL is another research line to consider, to see whether it can solve the same problems as transductive ZSL.

The full paper can be read here.

Alibaba Tech

First hand, detailed, and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.