Solving the Big Data Challenge? Deep Few-Shot Learning from a Multimodal Perspective

Frederik Pahde, Tassilo Klein, Moin Nabi (ML Research Berlin)

Published in

SAP AI Research

8 min readMar 5, 2019

We humans experience the world in a multimodal manner. We learn every day by not just seeing objects, but also by hearing, tasting, touching and smelling them. Our brain can easily connect these different information sources and teach us entire new concepts, based on only a few stimulators. If your child already learned what a cat, a dog and a horse look and sound like, it will be able to relate the information, and by this also understand what a unicorn is after seeing it only a few times. But how can conventional machine learning algorithms learn to understand the world around us in the same manner as we humans do? How can they be trained to connect and interpret different sources of stimulation, with only little amounts of training data?

Instead of training the model based on a single modality, incorporating data from other modalities during the few-shot model training might be a promising step towards solving the big data challenge, and by this radically transforming businesses and reimagining our work life.

Deep Few-Shot Learning — Recap

In a previous blogpost we already provided an overview about the general problem setting of few-shot learning. Touching again upon it, besides the outstanding results achieved by using deep learning models, the application of the same in businesses is rather unrealistic. Today’s deep learning models generally require large amounts of qualitative training data. However, this is in stark contrast to the data that companies are able to provide. We are referring to this as the big data challenge. Few-shot learning algorithms — algorithms learning and generalising from only a few data samples, showed to be a challenging but nevertheless promising approach in solving companies’ data issues. The applicability and benefit of few-shot learning in a multimodal setting however, hasn’t been explored by research yet. Since most few-shot learning approaches exclusively consider visual datasets, but data is usually available in multiple modalities (think textual data and descriptions), using a multi-modal training approach in the context of few-shot learning scenarios, equals the definition of a new research benchmark.

Multimodality — Inspired human learning

When we think about human learning and teaching, it can be recognised that images are often added to text descriptions to help visualise the main idea, such as in scientific books for example. On the other hand, textual descriptions are added to pictures to make sure the right message is delivered, such as perceived with paintings in a museum. By applying the same concept of human learning to machine learning, the key assumption is that incorporating multimodal data, hence images as well as precise descriptions thereof, support the identification of highly selective features across modalities. This facilitates the model training in few-shot scenarios. How it works is that novel concepts with a low amount of training data in any modality can benefit from previously learned alignments between other modalities, due to having learned a generalised concept.

To use a specific example, let’s consider the shoe classification example above.

If a novel shoe category is added to a dataset, but only few visual samples for this particular type of shoe exist, a textual description of the new category can provide additional information on the shoe type. This creates a multimodal dataset. If we now learn to align the visual and textual information we just gathered, meaning how the text describing shoe properties (high heels, elegant, sporty) and colours for example (red, white) are connected, this knowledge can be shared with novel, underrepresented classes. Although the existing data is highly limited for these classes, the aligned knowledge gathered from the image as well as the textual description of it, helps to identify selective parts of the image and to make use of previously seen concepts (red shoe color). On the contrary, exclusively using images for the classification of the added shoe category would lead to results that are more likely to be inaccurate, simply due to an inefficient amount and quality of training data, as well as the resulting inability to align and make connections between different sources of data.

Impacts of multimodal datasets — Image Generation

Having understood the multimodal training approach, now you might start wondering about the actual impact of having a multimodal dataset. What if we could create additional, non-existing images with the help of the described training approach? And what if those additional images could be used to better train models in challenging business scenarios where only a limited number of samples exists?

Now here is the background! Given the images as well as the provided textual descriptions, we strive to generate additional visual training samples that can be used to train more powerful classifiers. As to that a dataset based on visual descriptions could be vastly enlarged through the generation of new images based on their textual description. This is made possible by the recent rise of models for image generation, so-called generative adversarial networks (GANs) or variational autoencoders (VAEs).

Now what are GANs and how do they work?

Simply speaking, these methods train networks to transform random noise into meaningful images. In the case of GANs, there are two adversarial networks: a generator and a discriminator. Both networks are optimised for contradictory objectives, in which the generator must produce new images, and the discriminator must detect them as fake images. In this competition both networks are boosting in accuracy during the training until eventually the generator can generate highly realistic images.

Returning to our specific method, we used a publicly available dataset of bird pictures, as well as so-called conditional GANs to generate those images. They are a subgroup of GANs and can be trained to generate images given a certain condition. That means the generator does not only get random noise as input, which is the case with normal GANs, but is supposed to generate an image under a certain condition. In our context, this condition is an embedding of the textual description of the concept that is to be learned. In order to respond to the data issues many businesses have, we developed our method for a typical few-shot learning task. Here, we use base classes for which still many training samples exist, and novel classes for which the amount of training samples is highly limited. In detail, we followed a meta-learning framework where we learn a generative model based on the large amount of data available in the base classes, then utilise it to learn a classifier for the limited novel class samples. In order to do so we built a text-conditional GAN, which is training the generator to produce images that cannot be distinguished from “real” images by the discriminator.

The general idea of the described approach is to use the training data of the base classes to train a conditional GAN that allows for the generation of new images under the condition of its textual description — in our previous shoe example this would be the generation of red high-heeled shoes. Using the few existing real samples, as well as many generated images, an extended dataset can be composed to alleviate the lack of qualitative training data. This dataset in turn can be used to train an arbitrary classifier, and by this help solving the big data challenge many businesses today are facing.

For details about this method see the poster at the bottom of this blogpost or read the full paper on arXiv.

Connecting the dots — Solving the big data challenge

In order to test our model, using the mentioned bird dataset, we optimised a conditional GAN for each dataset to generate images, based on their textual description. The results shown do not depict real images of birds but are generated, highly realistic images for novel classes. We observed that discriminative features, such as shapes and colours described in the additional texts, are contained in the generated images. By extending the training set by these generated bird images, — instead of using only limited training data of real images, — we were able to observe that accuracy was increased by around 42% in a one-shot training. Our experiment shows that using additional generated images to train the algorithm highly increases the preciseness of the trained model.

Practical application — Improved product catalogues

Ultimately the presented work has shown that multimodal data in few-shot learning scenarios successfully helps to improve classification results. It demonstrates that learning generative models in a cross-modal fashion facilitates few-shot learning, by compensating the lack of data in novel categories through the generation of new nonexistent data.

Practically our findings are applicable do diverse industries, and basically to any business struggling with the big data challenge and limited number of training data. A huge practical impact might be seen in improved product searches, where businesses could train machine learning models for the classification of new products despite limited availability of training data. Online retailers could hence adopt classification models to new products such as seasonal products, with improved accuracy not having to bother with the challenge of providing enough data due to the similarity of the products.

We presented our work on few-shot learning with multimodal data and the conference poster (below) at the WACV 2019. The full research paper is available here.

About the author: Frederik Pahde, a M.Sc. student at Humboldt University Berlin covered the challenges discussed in this post in his thesis.

Solving the Big Data Challenge? Deep Few-Shot Learning from a Multimodal Perspective

Frederik Pahde, Tassilo Klein, Moin Nabi (ML Research Berlin)

Written by SAP AI Research