Dataset in a day

Roland Meertens
Bumble Tech
Published in
9 min readNov 28, 2023

--

A clustering-based approach to create deep learning datasets in a day

Introduction

Understanding what’s happening in an image is both an important task, as well as a costly one. In the last few years, the field of computer vision has greatly accelerated due to the advances in neural networks. At Bumble Inc., we see potential value in computer vision for a variety of use cases, such as improving the safety of our platform and providing our members with a better user experience.

The most common way to train these neural networks is by showing it many images with the corresponding label. Unfortunately, this can be a costly task. Not only does one need to build and train the model, one also wants to do hyperparameter search over multiple configurations of possible networks, and — of course — one needs to find or build a dataset suitable for the task at hand.

Building the dataset is both the most important task, as well as a very time consuming one. Gathering data, setting up labelling requirements, and of course the labelling itself all take a lot of time and money. This normally leads to trade-offs, by choosing either to build only a small dataset, or by trying to fit existing datasets into your specific use-case.

One alternative is of course to not build a dataset at all, to instead use zero-shot learning for your use case. I argued in the past that this is unreasonably effective, and allows you to test your use-case before even training a model. When using zero-shot learning one predicts labels without explicitly training on the classes you are trying to learn. One example of this can be achieved by using the CLIP model, which is trained to have a strong association between text and images. By looking at the distance between the description of your class and the image you can run inference without training anything. However, there are some use cases where we need the strongest possible model by fine-tuning it to our specific data.

Using foundational models for data selection

Foundational models, such as GPT-3 for text and CLIP for images, provide seemingly amazing understanding of the world around us. If your data represents something which can be abundantly found on the internet one can immediately start using these models for their use case. However, these models on their own are not providing you with a dataset.

In our experience, foundational models are great at retrieval of specific data. This is amazing in case you want to find rare or very specific examples of data. For example, for a self-driving car you might be interested in retrieving examples of people in wheelchairs, or ambulances and other emergency vehicles. However, we noticed that foundational models are not achieving a high classification accuracy at zero-shot learning.

Clustering in latent space

The big trick for foundational models is understanding that similar concepts are close together in the so-called “latent space”. The output of a neural network might be a number of class predictions, but in the layers above that one has a list of numbers. However, there is a logic to these numbers in that similar concepts will be close together. That way the later layers are able to differentiate between the underlying concepts the network uses for predictions. Note that the underlying concepts are never explicitly given to the network. We don’t say that “a dog has four legs and a furry skin”, it simply learns that some things have legs, some things have a furry skin, and that something which has all of that might be a dog. That also means that we don’t really know what aspects are learned, and why certain things are close together in an embedding space.

The approach we developed at Bumble Inc. hinges on the fact that there is meaning in the latent space of foundational models. Ideally, if we label one image, we would like to immediately label all similar images to the one we labelled. In this case, similar would be all the images which are in the same area in the latent space of a foundational model.

Unfortunately, we already explained that foundational models such as CLIP don’t explicitly tell us what ‘aspect’ of an image is the reason for clustering two images together. For example, look at the two images below. There are multiple aspects for which these images could be ‘close’ together. For example, both are taken in the mountains, both are showing me doing sports, both are selfies posing with others, and both are photos where people are wearing helmets. In practice we see that if we cluster our data in the CLIP embedding space we get clusters which are not always the clusters we wanted or expected — if as humans this is an easy task to perform, it isn’t as trivial when done iteratively at scale. Interestingly enough, we found clusters like ‘people leaning onto things’, ‘people who try to look like angels’, and ‘people posing with a flag’.

This is where CoCa comes in. CoCa is a network which can automatically create captions for images. For example, the above images are captioned as:

  • a man and a woman posing for a picture on a ski slope.
  • a man and a woman standing on top of a mountain.

We can see that the captions are far from perfect, but at this point we don’t really care. We can at least ‘explain’ to a certain extent what is in a photo, and can do so for each photo in a cluster. CoCa is built on top of the embeddings which CLIP generates. This is great for our use-case, as it means that clusters in the CLIP embedding space can be automatically described using CoCa.

However, reading this for each photo is a lot of work. We want to have a summary of what is happening in a cluster. This is where Bumble’s open-source Buzzwords library comes in. The library allows us to take all captions in a cluster to summarise it. We notice that this gives a reasonable description of clusters of various nature. For example, we can assume that the cluster with the buzzwords “mountain skiing goggles” is a cluster of photos taken on a snowy mountain, and the cluster ‘selfie standing mountains hiking sunglasses group is probably a collection of hiking photos.

With the above descriptions one can simply label the entire cluster at once. If one needs a classifier for ‘cats vs dogs’ one can immediately search for all clusters which contain the keywords ‘cat’ and ‘dog’.

Our experiment

One way we experimented with our dataset is by training a neural network on several classes to determine what is happening inside of a photo. We chose relatively broad classes to demonstrate that our dataset manages to capture a wide variety of contents in photos. The classes are relevant to what people have in their dating photos, and explain what kind of lifestyle they have. The classes we chose were:

  • “Animal”: for photos of pet lovers with their fur babies
  • “Children”: for photos people took with or around children
  • “Food/drink”: for photos taken in bars and restaurants by foodies.
  • “Music“: for photos with musical instruments and at concerts
  • “Outdoor activities”: for the ones who like to be outside for anything from skiing and hiking to laying on the beach
  • “Sport”: for photos of people doing anything from riding a bike outdoors to playing soccer in a hall
  • “Staying in”: for any activity which is performed inside, such as playing boardgames
  • “Vehicles”: for those attached to their car, van, or bike.

During training the network we predict all labels at the same time, phrase it as a multi label classification problem, and use a binary cross entropy loss. Note that not every photo has to have one of these labels. In fact, most photos in our dataset don’t have any label attached to them. The most common reason is that they are selfies without the subject of the photo doing anything we could act on.

The dataset we created contains a million photos which are created by inspecting the above mentioned clusters. We labelled 821 clusters manually to apply the above mentioned tags to each of the photos in the clusters. This gives us 163 clusters with any tag (some clusters get multiple tags, such as outdoor sports) and 651 clusters which clearly do not fall into any of these categories. Not every photo in the dataset is used during training though, we only use photos which have at least one of the corresponding categories. This gives us 221.385 images, which we split into a train and test set.

The model we train is a ResNet50 model which we train for 50 epochs (with early stopping enabled) and evaluate it on a hold-out dataset. Above you can see the performance of the model trained on 1 million images. We can see that it learns all classes reasonably well. The hardest class to learn is our class “staying in”. This is also one of the more diverse classes and contains a very wide range of activities which thus also makes it hard to generalise.

We also see continuous improvements through the addition of more data. Although the rule ‘more data = better’ was already a staple of machine learning it’s good to see that more data from auto-generated clusters also keeps improving the final model performance. Note that in this case we don’t need to label extra data to actually get more data. Because we are labelling whole clusters we can simply gather more unlabelled data, see if it belongs to any clusters we already labelled, and assign the same label to these images.

This model is useful for us for several reasons. We could use it either for matching purposes (e.g. suggest pet lovers to others with the same interest), or use it for feedback on profiles (e.g. “you say you like dogs, but we don’t see any pets in your photo”).

Privacy by design

The last feature this approach has is that one can create a dataset of photos without having to look at every single photo. Only looking at a few images from each cluster gives you an idea of what the photos in the cluster represent, and one could even choose to not look at a single image but only look at the descriptions. Naturally this is beneficial if one is working with privacy-sensitive photos — there is no need for anyone to look at what is happening inside every single photo specifically if one can simply infer what is happening by the buzzwords topics. The job of image moderation can be very emotionally taxing, and simply reading what is happening in an image rather than having to see it goes a long way.

Conclusion

We presented an efficient way to create a large dataset for any computer vision application using unannotated data. When we are using this approach we always get great results, even for tasks where the object to classify can be hard to spot. Although we acknowledge that there will be some noise in the data, we believe that there is a large benefit of creating a large dataset in a short amount of time. Additionally, we hope more companies will be inspired to take this approach and continue to improve their processes while protecting the privacy of their user base.

--

--