How to collect a dataset from 50 thousand images: Neatsy startup experience

Published in

Neatsy AI

8 min readFeb 8, 2021

Hi! I’m Roman Kucev, data scientist at Neatsy, Inc. My team and I are developing an app which helps you choose shoes online and order the correct size.

To bring this idea to life we had to train a neural network. To train this network we had to collect and markup 50 thousand photos of feet of different size. I’m going to tell you about our experience and why crowdsourcing is an awesome instrument to collect data.

Ask 200 acquaintances for help, collect 3000 photos and get blacklisted twice

We had the following task: collect a dataset from 50 thousand photos to train the algorithm to build a 3D model of a foot automatically and then define its size even in poor lighting conditions and on coloured tiles. We’ve created filming requirements with a set of different conditions:

1. Lighting: artificial, daylight, low-light;

2. Background: parquet, linoleum, fluffy carpet, coloured tiles — everything was good enough;

3. Skin colours: from light to dark skin;

4. Foot angle: from different angles: top view, bottom view, side angle.

I had to look for different people for data labeling before: prior to working in Neatsy, I was employed at Prisma, where I had to collect data to train neural networks. At Prisma we were lucky because our developer Vyacheslav Tarasov was teaching at Voronezh State University and he had a lot of students who could help. I don’t know how he motivated them, maybe with higher grades or something else, but in the end they sent all the necessary photos and videos and we collected the data successfully.

The Neatsy team didn’t have any access to students. That is why we found a different way. We posted stories on Instagram, called and texted all our friends, acquaintances and relatives, shamelessly distracted them from their work and asked them for their feet pictures. We acted just like a religious cult of network marketing. Two of our acquaintances even blacklisted us.

Realize that it’s not that easy

Unfortunately, that strategy was a failure. Firstly, we only managed to get 200 people on board because we had a limited amount of acquaintances.

Secondly, this method took too much time and effort. We had to contact each person and explain the process to them. Then we had to wait for the data, download it and check it twice. We spent about 20 minutes per person on average. It took us about 8 working days to collect 200 videos. But we needed more than that.

To clarify why we collected videos but not photos. If you train a neural network using life data, its quality will be higher. We planned to train our neural network to segment feet using the video stream. That is why the still frames from a video were better than photos.

We’ve collected about 3000 images overall. It would be enough to check the hypothesis and create an MVP. But we needed more data to create a product, that works everywhere and for everyone.

Get on the right path

After taking some knocks we decided to take a different path and gave this task to Yandex.Toloka. Here are some points which were very useful:

Transfer of copyright. The lawyers in international companies are often worried about the clear transfer of intellectual property rights. In Toloka everything was clear and easy: the markup results belong to the client. So rest assured.

Flexible scaling. We like that Toloka works like a marketplace: if the task is assigned — they do it, if it’s not — they work on other tasks and they do not expect anything from us. This is much more convenient than hiring staff of markers who will not have enough workload: sooner or later, urgent tasks may end.

A huge number of workers. Tens of thousands of people work on crowdsourcing platforms every day. We calculated that our task, which our team of markers will do for three weeks, can be done by performers on a crowdsourced platform in a day.

The cost of markup. Going forward I should say that we got a huge amount of photos in five days, spending less than a $100 on it.

Now let’s cut to the chase— Let me tell you how we’ve collected the data and what the results were.

Configure the pipeline properly

The main rule of crowdsourcing is to decompose the big task. That means to split it up into a number of small subtasks. We figured out what stages the project would consist of and divided it into four parts.

1. Video collection

First of all, we asked the tolokers to film their feet and provided them with a video example: show beats tell. In the first version of the instruction we were asking to place any plastic card on the floor, assuming that the card in the frame will help improve the accuracy of the 3D scan. But this did not provide an increase in accuracy, so we dismissed this idea, and the final set of instructions looked like this:

video duration — 20–60 seconds (we need about 30 seconds of video from one toloker to get 30–50 different frames from each one);
video must contain feet without socks and shoes, with pants rolled up to the calves;
room should not be too dark;
video must be shot from a variety of angles: we’ve instructed to change the height and angle of the camera.

The instructions looked like this:

It is that detailed for a reason: the clearer you explain, the more accurately the performers will complete the task and the smarter the neural network will be.

We’ve collected 2472 videos in a span of 2 days and 17 hours. To make the videos as diverse as possible, we have set a limit on the number of tasks. One toloker could only send us one video.

The assignment was with postponed acceptance: we sent videos for review to the next group of performers to verify that they were properly recorded. After that, the first group of performers, whose videos were accepted, received their money.

2. Video and photo verification

At this stage, the second group had to determine whether the content sent by the first group of tolokers corresponded to the task given. But first, the performers went through training, and access was given only to those, who performed the tasks well. It worked like this: we asked to check 30 videos (all the answers were known beforehand) and counted the number of correct answers from the performers. If the accuracy was more than 85%, we approved them for the main task. Here’s what the instruction looked like at this stage:

Pay attention to the first paragraph: it is very important to tell the performers what they are working for. Realizing that their work is not in vain, they get into work more actively and perform tasks better.

It was also important to set up quality control rules: both the quality of the dataset and the motivation of performers depend on them. You identify and block those who perform tasks poorly, and reward those who work properly. To reward good performers, we paid more money for the tasks that were performed better. To filter out careless performers, we used honeypots and blocked them for overly quick responses. Honeypots are “test” tasks for which we already know the correct answers to. Outwardly they do not differ from the rest, but the answers to these tasks can be used to determine how well the toloker performs the task.

We accepted 1507 out of 2,472 videos. For each accepted video, we paid $ 0.025, and another $ 7.41 for reviewing all the videos. So, in total we spent 45 dollars on this stage of the project. Pretty neat in my opinion.

3. Video storyboard

The next step is to split the video into frames. We did it automatically, without using tolokers. I used the FFmpeg program, which processes pictures and videos very fast. Every tenth frame was used, so from 1507 videos we’ve collected 156,576 frames, having received three times more images than planned.

But for high-quality training of a neural network a variable dataset is needed: it should contain pictures that are drastically different from each other. In our case, the dataset contained many twin frames. If we left them in, we would not achieve an improvement in quality, but only spend a lot of money on the markup. So I removed them automatically using the ImageHash library. For each image I got a hash — a set of numbers that characterizes the image. The hash compilation algorithm is designed in such a way that similar pictures will have a similar hash, different pictures will have a different hash. After clustering the hashes, I identified all similar images and left only one frame per cluster. At the next stage tolokers checked 57 thousand frames.

4. Verification of frames — the final stage in Toloka

The final stage of work in Toloka is frame verification. Sometimes the camera can shake during recording, resulting in blurry frames in the video. We removed the blurry frames and selected the clear frames only, where the person’s foot is visible. This is how it looked:

I managed the whole process in Toloka by myself. At the beginning the project was quite difficult to set up (write an interface, instructions, build the quality control), but when it was established everything went by itself and did not require constant attention.

5. Bonus: additional content collection on Amazon Mechanical Turk

At that time, mainly residents of Russia and the CIS countries worked in Toloka, so we received photos of feet with fairly light skin. We wanted the neural network to be able to work with darker skin too. Now this problem is no longer there — performers from India and Africa are now working in Toloka as well. But back then we had to collect other shades through the Amazon Mechanical Turk platform. We created one task in which we asked people to record a video of their feet, upload it to a file hosting service, and send us a link. The assignment was with postponed acceptance: our intern was in charge of the verification process. If you compare the prices, the cost of one video from Amazon was higher than one from Toloka — $ 0.1. As for the quality, same thing happened when working with Toloka: some performers sent us poorly made videos, but we rejected them and did not pay for these tasks.

Look back and draw conclusions

Here are our results:

collected: 156,576 frames;
put in practice: 50,994 images;
time spent: 5 days;
money spent: $ 75.

We were afraid that nobody at Toloka would want to perform our strange task. Fortunately, this fear did not come true, and we obtained what we came for. We were worried that tolokers would cheat and perform tasks incorrectly. But with the help of quality control tools, we managed to make sure that people had sent us what we needed. And the most important thing: we collected the necessary data in less than a week!

Recently we finally finished the development and released the application to the public. But that’s a completely different story.