Build your own celebrity dataset

Published in

Toloka Tech

7 min readMay 17, 2021

One day I got a request from the product team. They urgently needed a neural network that can differentiate Kirkorov from Face (two different singers).

So I asked our head of neural networks what data format we would need. The answer was two folders (one for each singer), around 500 images per class, all 299x299. Now the challenge was where to find the data. Here’s a good plan for finding datasets:

Check public datasets like ImageNet, COCO, openimages.
If these sources don’t turn up required labeled data, try googling. Go to arxiv.org and look through related papers. If you’re lucky, you’ll find a link to a usable dataset.
If both these efforts end in failure, you will need to create a dataset from scratch.

We can safely assume that no one has previously undertaken the task of classifying Kirkorov and Face. We decided to create this dataset ourselves. This is what our data pipeline looked like:

Download 1,000 images of Kirkorov and Face.
Resize the images.
Verify the images with crowd performers in Toloka.

For the download, we used Google Images Download.

Our terminal command looked like this:

In reality, it’s not quite that simple because you can only download 100 images per query, so you need to perform multiple downloads with different settings.

We resized the images to 299x299 right away:

This completes the first two steps, and all that’s left is verification. Looking at the downloaded images, we can see that most of them are correct, but there is some garbage.

We needed to check the images quickly, so we went straight to Toloka.

Toloka is a crowdsourcing platform (like a freelance marketplace on steroids) where the requester creates a task, uploads the data for it, and gives it to crowd performers. There are other data labeling platforms like this, but Toloka is probably the most convenient one for this type of task.

How to get started with Toloka:

Register as a requester on toloka.ai (you should also register on their sandbox to create tasks and make sure they work properly for the performers before going live).
Top up your Toloka account.

3. Go to Projects and click “Create a project”.

4. You can choose from a variety of preset templates. The “Image classification” template worked well for our task.

5. Add instructions for the performers and name the project.

6. The input and output parameters will use the image URL and the output string. The task interface is formatted using HTML and JavaScript, which is super convenient because it means that we can create an interface page for almost any type of task. We adjusted the HTML template ever so slightly for our project. When you’re ready, save the project.

7. Add the project to the task pool. I didn’t create an additional training pool because I hoped that performers would complete the task correctly on their first try. However, I really recommend creating a training pool. This allows you to:

Filter out performers who didn’t understand the task.
Make sure bots who always choose the same answer aren’t admitted to the task.
Teach performers how to complete the task well.

Name the pool (the name will only be visible to you) and make other settings. Here are the settings we used for our task:

Price: $0.01 per task page. Toloka adds a commission of $0.005.
Time for task completion: 10 minutes.
Overlap: 3 (this means that each photo will be shown to three different people).
We turned on “Adult content” because who knows what we downloaded from the internet.
We also activated “Non-automatic acceptance” to make sure we wouldn’t be paying for poorly performed tasks.

Next, set up user filters. Our task was only available to performers from Russia who speak Russian. Each performer has their own rating, and we selected only the ones who have high ratings. We chose the top 80% of performers.

The quality control section is the most important and most challenging part to master. It would take a separate article to go into all the details, but I recommend looking into it yourself. For our project, I chose the fast responses constraint. If a performer responds in under 40 seconds 3 out of 5 times, they get banned automatically.

Now we’re almost at the finish line. Save the pool.

8. Prepare the input data. Here’s how we did it:

In the created pool, we downloaded the sample file.
We uploaded our images to our server so that performers could access them via links.
We inserted the image URLs in the “INPUT: image” field in the file.

Example:

INPUT:image
http://kucev.ru/Kirkorov/Kirkorov-pink_17_ 21787_l_jpg.jpg
http://kucev.ru/Kirkorov/Kirkorov-green_82_ hqdefault_jpg.jpg
http://kucev.ru/Kirkorov/Kirkorov-pink_25_ hqdefault_jpg.jpg

We uploaded the prepared TSV file to Toloka.
Then we set how many images to show to a performer at once.

If you don’t get any errors after uploading your TSV file, it’s time to launch the pool.

9. You need to test your task yourself to make sure that it works correctly. To do this, you can create another account on Toloka (but this time as a performer). In your requester account, add your newly created performer account to the list of trusted performers.

After launching the pool, you can see your task from the performer account. Click Start to view the task.

This is what our page looked like for the performers:

Make sure that everything works right and there aren’t any bugs.

10. If you are happy with the result, move the project from the sandbox to the live version of Toloka by choosing Export.

In the live Toloka you can see that your project was exported successfully. Open it and double check to make sure everything is correct. Just a warning: once I made a mistake when creating a task and 2,000 performers couldn’t submit their results. You don’t want to repeat my experience!

11. Now you can launch the pool and wait for results to come in.

Our results: It took 20 minutes to get 1159 photos labeled. The task page for each performer displayed 40 photos, for a total of 29 task sets (1159/40 = 29). But we set an overlap of 3, which means that overall, 87 pages were shown to performers (29 x 3 = 87). We paid $0.01 + $0.005 for each task page, so we spent a total of $1.3 to check 1159 images.

12. Finally, double-check the tasks and download the results.

We opened the results file in pandas and grouped by “INPUT:image”.

Using the majority vote method and assuming that the image shows Kirkorov if at least two performers voted this way, we ended up with a total of 634 + 57 = 691 verified images of Kirkorov.

Pros and cons of using Toloka in your data pipeline

Toloka offers several advantages for this type of project:

A flexible platform that you can adapt to any task.
Very inexpensive.
A huge number of performers available to label large amounts of data within a relatively short timeframe.
Handy tools for managing performers, which results in high-quality labeling.

Toloka also has one huge minus: you can’t work with personal data because NDAs are not signed with the performers, and it would violate GDPR requirements.

Build your own celebrity dataset

How to get started with Toloka:

Pros and cons of using Toloka in your data pipeline

Written by Roman Kucev