All the ways to acquire and label image data in 2022

6 min readAug 18, 2022

After my last video on dataset generation, I was asked several times about “How to prepare datasets for computer vision correctly.” The last time I wrote my guide was about three years ago. And since then, many things have changed or become more productive. So I decided to update my collection of general ideas and methods.

Very often, people think that there are four ways to collect datasets.

Buying datasets
Open datasets
Collecting datasets
Synthetics

It looks like these points are different. But in reality, there is no clear boundary between the items. And each one has many sub-items. For example, where does “re-label an open dataset” or “collecting open data” belong?

I’ll look at dataset collection strategies as the individual items and make more examples and structures within them. If there’s anything I’ve forgotten, write in the comments!

Data collection

Organizational part

Technically, you can organize data collection in one of four ways:

Buying data. A lot of data is for sale. Usually for popular tasks, such as face datasets or license plate datasets. But the rarer or more specialized the research, the harder it is to buy such a dataset.
Hiring data provider. We have worked with many teams that collect and markup data. And we’ve hired individuals. Datasets are built using the same approaches that will be discussed below. But you outsource this part and don’t think about it. The main problem with this approach — there will not be enough feedback.
Self-collection/collection by your team. I’ve helped organize data layout teams within 3–4 companies. But you have to do team management, which is time-consuming.
In my opinion, it only pays off in two cases. Either if you have LOTS of data (like Tesla). Or if you have particular data. This is something I see a lot in medicine.
Synthetics. We’ll talk more about synthetics below. I’m not a big fan of synthetics: good synthetics usually take more time than collecting a dataset. But there are tasks where it makes sense.

Technical part

Before you read this part, a little disclaimer: we will talk about parsing sites and data. This is a sensitive topic. And it is regulated differently in different countries. Before using the methods presented here, don’t forget to consult your lawyers.

Let’s talk about basic approaches to data collection:

Open datasets. There are thousands of available datasets on the Internet. Faces, cars, tracking, pizza, hot dogs. That kind of thing. It seems to me that every person who has been in the profession for 3–4 years has helped with the creation of at least one such dataset. I think even I’ve posted 3–4 of them. You can find a lot of them.
Not all such datasets are labeled as they should be. Sometimes, they need to be re-labeled.
An example of a good dataset is Open Images.

Parsing Sites. Open datasets smoothly flow into “parsing sites.” For example, Open Images is partially parsed from Flickr. I know many companies that parsed data from Flickr. Let’s systematize a little bit about how you can parse:

Parse by already existing tags (Flickr, Stocks, e.t.c.)
Dumping with classification images by another model
Selenium. Create a script to take images. Such approaches are used in social networks.
Manual parse (Google search, etc.). At the same time allows you to filter precisely your type of data.

Few samples:

The most famous example is a contest on Kaggle for recognizing the type of matrix of a phone. At that time, participants were parsing all the images with the tags from the original devices (mainly from Flickr).
I know three Face Recognition companies that parse VK and Instagram to collect faces to train neural networks.
OpenSource Dall-e (ruDall-e) has interesting leakage that tells you what there was in the training dataset:

Parsing open cameras. There are sites where you can find ready-made video broadcasts — for example, Youtube or Ivideon Tv. And then everything is classic. Either you are collecting the whole video or collecting the video by triggers.

I’ve seen it downloaded from YouTube by tags, e.g., from match records.

Collecting from production. This is the most challenging part. I need to write 2–3 separate articles to tell all the variations. The data is usually too much, and it’s hard to find enough diverse examples. Or the data is hard to get. Adequately building an automatic collection system is unique to each project. And that’s an amazing piece of the product in many ways.

Gathering through crowdfunding — platforms or hiring actors. Websites with task generation. For example, this dataset was collected entirely through the crowdfunding platform.

Synthetic data

Separately, we should talk about synthetics.

I know a lot of projects where a complete synthesis of the environment was made based on 3D editors, for example, Unity or Unreal. An excellent example of such a system is Omniverse from Nvidia.
But remember the minuses! This kind of synthesis is costly. Even the most straightforward systems will take several months to develop.
Unique synthetics for the task. Sometimes it is possible to make synthetics somehow simple. For example, here is an excellent example of generating datasets for license plate detection and recognition. You can train the basic detector by adding the number to random pictures.
It is necessary to understand that such synthetics will give the quality worse than the training on the actual dataset. But it will take 1–2 days to write such synthetics. And sometimes, it is a good starting point.
Using generation by another neural network. It seems to me that this way is the future. I made a video on this topic.
Augmentations. We should not forget that having a small dataset can be adjusted by augmentations. The simple way is through Albumentation. The hard way is through some custom-made augmentation like Pose Transfer.

Labeling

After the dataset is collected, it has to be marked up somehow. Let’s go over the basic approaches:

Use a ready-made dataset from OpenSource datasets. Very often, you can get the data already marked up after collection.

Outsource labeling. As well as for data collection, many companies help with outsourced labeling.

Semi-automatic labeling. If a model is not working perfectly, it can help you with markup. This functionality is often built into markup services (RoboFlow, Supervisely, e.t.c.). But it’s not hard to do it in the CVAT as well.

Using a priori information. Sometimes the markup can be obtained as part of the employee’s work process. For example, use doctors’ diagnoses as the labels. We often use such approaches to control robots. And the control is the annotation.

Manual labeling. 90% of the labeling tasks (even if it’s outsourced) will be done by trivial annotation. And there are many approaches on how to do it:

Hi-tier annotators. Such as V7, Supervisely, e.t.c.
Simple annotators. Like CVAT or LabelStudio.
Self-written ones. For many tasks, making your labeling software may still make sense if it is non-standard data or some internal environment.
Sometimes you can use someone else’s annotator. And tweak it a bit. For example, when skeleton recognition meshes first started to appear, many people made their annotators based on existing ones for their skeleton models.

I hope this little guide will allow you to structure a bit different approaches and find the way that helps you.

You can always ask me questions on my LinkedIn (or subscribe and read different stories). And don’t miss my youtube channel.