How to Create Synthetic Dataset for Computer Vision (Keypoint Detection)

A simple and quick way to generate a large dataset with help of Python, OpenCV, Numpy and Albumentations

Alex P
15 min readJan 25, 2022

Training a keypoint detection model, like Keypoint RCNN, requires a dataset which contains images with objects of interest and annotations (text files with coordinates of objects’ keypoints and bounding boxes).

For example, on the image below you can see keypoints and bounding boxes visualized. Each object (glue tube) has two keypoints (head and tail).

Example of image with keypoints and bounding boxes

The more images dataset contains, the better the model will train, because it will see more examples during training process. Dataset which contains 200+ images is ok. Dataset with 1000+ images is much better. The exellent dataset contains 5000+ images.

Note that dataset should not just contain a lot of images, but all the images should be as varied as possible. Objects of interest on those images should be mixed with other objects, presented in different environments, with different backgrounds, in different positions, etc.

One way to create dataset is to create it manually. It means that we take a lot of photos, like the photo above, and annotate them by hand. Such approach is the best, because all photos are real, but it takes a lot of time to create such dataset.

The other way is to create a synthetic dataset automatically. With this approach, cropped photos of objects of interest are randomly scaled, rotated and added to background with python script. Annotations are created with the same script. Under this approach we create images, which are not real photos, but objects on these images look 100% realistic.

Example of an image from synthetic dataset is below:

Example of an image from synthetic dataset with keypoints and bounding boxes

Automated process allows us to spend much less time to create a dataset comparing to manual process. For example, generation of 1000 synthetic images and annotations can take less than an hour. It’s much faster than taking 1000 various photos and annotate them by hand.

Below, I’ll describe all the steps of creating synthetic dataset for keypoint detection.

I’ll show how to create synthetic dataset with glue tubes for training Keypoint RCNN. To do this, we need the following data:

  • cropped photos and masks of the object of interest (glue tubes) in various positions + coordinates of keypoints (the 1st keypoint is head, the 2nd keypoint is tail) on every photo;
  • background images (just different photos from the internet);
  • cropped photos and masks of different objects (cars, chairs, guitars, etc.) which will be used as background noise to make background more complex.

I’ve took 14 photos of glue tube and created masks for them:

Photos and masks of objects of interest

I’ve also created 14 json-files with coordinates of keypoints (head and tail) for each photo. Besides coordinates, json-files contain visibility of keypoints. That is, each glue tube has 2 keypoints, head and tail, which are described in [x, y, visibility] format. All keypoints are visible (i.e. visibility=1) in this dataset.

I’ve collected 60 images which will be used as background. Take a look at some of these images:

Images which will be used as a background

I’ve also collected 107 images of different objects which will be used as background noise. These can be literally any objects which are not glue tubes:

Objects which will be used as background noise

Download zip-archive with the described above data from here.

Update. You can also check this short video tutorial to see how to create mask of the object with help of Photoshop.

Here is how to use the downloaded data to create a synthetic scene:

  • First, we will randomly choose a background image from folder bg/ and resize it to, for example, 1920x1080.
  • Second, we will randomly pick a background noise object from folder bg_noise/. Then we’ll ramdomly resize, rotate and add it to the background image.
  • We will repeat the second step several times.
  • Third, we will randomly pick an object of interest from folder images/. Then we’ll ramdomly resize, rotate and add it to the background image on the top of background noise objects from the previous step.
  • We will repeat the third step several times.

Obtained random composition of objects is a synthetic scene.

Synthetic dataset consists of many synthetic scenes.

Let’s create a script which creates synthetic dataset.

1. Imports

Create a new notebook in Jupyter Notebook.

First, we need to import the necessary modules:

2. Paths to files

Unzip downloaded archive with data to folder data/ and create lists with paths to images, masks and keypoints:

Take a look at the output to better understand the structure of the created lists:

The first five files from the sorted list of object images: ['data\images\1.jpg', 'data\images\10.jpg', 'data\images\11.jpg', 'data\images\12.jpg', 'data\images\13.jpg']

The first five files from the sorted list of object masks: ['data\masks\1.png', 'data\masks\10.png', 'data\masks\11.png', 'data\masks\12.png', 'data\masks\13.png']

The first five files from the sorted list of object keypoints: ['data\keypoints\1.json', 'data\keypoints\10.json', 'data\keypoints\11.json', 'data\keypoints\12.json', 'data\keypoints\13.json']

The first five files from the sorted list of background images: ['data\bg\bg_1.jpg', 'data\bg\bg_10.jpg', 'data\bg\bg_11.jpg', 'data\bg\bg_12.jpg', 'data\bg\bg_13.jpg']

The first five files from the sorted list of background noise images: ['data\bg_noise\images\1.png', 'data\bg_noise\images\10.jpg', 'data\bg_noise\images\100.jpg', 'data\bg_noise\images\101.png', 'data\bg_noise\images\102.png']

The first five files from the sorted list of background noise masks: ['data\bg_noise\masks\1.png', 'data\bg_noise\masks\10.png', 'data\bg_noise\masks\100.png', 'data\bg_noise\masks\101.png', 'data\bg_noise\masks\102.png']

Later, our script will have a block of code which will randomly pick an object image from these lists, resize it, add augmentations to it, and add it to background.

3. Images, masks and keypoints

There are several types of masks:

  • Original mask is the mask where object area is filled with black color (0,0,0), background area is filled with white color (255,255,255).
  • Boolean mask is the mask where object area is filled with True, background area is filled with False.
  • Binary mask is the mask where object area is filled with 1, background area is filled with 0.

For the purpose of this script, we will convert original masks to binary masks.

Here we define a function get_img_and_mask() which returns image of the object in an OpenCV format, and mask of the object in a binary format:

Also we will define a function visualize_single_img_with_keypoints() which visualizes an image with object of interest and draws object’s keypoints on this image:

3.1. Object of interest (with keypoints)

Let’s look how get_img_and_mask() function works:

Output:

Image file: data\images\5.jpg
Mask file: data\masks\5.png

Shape of the image of the object: (735, 1111, 3)
Shape of the binary mask: (735, 1111)

Note that the width of the image is 1111, and height is 735. Also, the image has 3 channels. That’s why the shape of the image is (735, 1111, 3). Binary mask has the same width and height, but only one channel. That’s why shape of the binary mask is (735, 1111).

Let’s visualize keypoints on this image:

Output:

Keypoints: [[979, 103, 1], [132, 594, 1]]

On the image above the first keypoint, head, has x coordinate 979, y coordinate 103, and visibility=1. The second keypoint, tail, has x coordinate 132, y coordinate 594, and visibility=1.

3.2. Background noise object (without keypoints)

Let’s get image and mask of a random noise object with get_img_and_mask() function:

Image file: data\bg_noise\images\18.jpg
Mask file: data\bg_noise\masks\18.png

Shape of the image of the object: (1280, 1073, 3)
Shape of the binary mask: (1280, 1073)

4. Resizing background images

Images, which will be used as background, have different sizes. For example: 2114x1398, 3456x5184, 1920x1440, 3264x4080, etc. Some of them are horizontal (width > height), others are vertical (height > width).

But we may want all images in the synthetic dataset to have fixed dimensions: 1920x1080 for horizontal ones, and 1080x1920 for vertical ones. To achieve this, we will resize background images with help of resize_img() function:

Let’s look how this function works:

Output:

Shape of the original background image: (3068, 2454, 3)
Shape of the resized background image (desired_max=1920, desired_min=None): (1920, 1535, 3)
Shape of the resized background image (desired_max=1920, desired_min=1080): (1920, 1080, 3)

You can see that the function finds out which side of the image (width or height) is the longest, and resizes image to desired_max size along the longest side. If desired_min is not set, then the shortest side of the image is resized proportionally, otherwise image is resized to desired_min size along the shortest side.

5. Resizing and transforming objects of interest (with keypoints)

Function resize_transform_obj() for resizing and transforming objects is similar to the function for resizing background images, but has some additions.

Function resize_transform_obj() resizes image of the object and binary mask of the object. Also, transforms from albumentations library can be passed to the function as argument. Coordinates of keypoints are affected during resizing and transforms as well.

In the code above a complex transform transforms_obj is defined. This transform rotates image and changes contrast & brightness in a narrow range. It will be used to transform objects of interest.

It’s possible to add more options to transform. Read albumentations documentation to find out how to do it.

Let’s look how function resize_transform_obj() works:

Output:

Shape of the image of the transformed object: (983, 650, 3)
Shape of the transformed binary mask: (983, 650, 3)

You’ve seen these image and mask earlier, but now the shape of the image is (983, 650, 3) instead of (735, 1111, 3). Also, it the image is rotated and the brightness here is higher than before. This is how transforms work.

Let’s visualize keypoints on the transformed image:

6. Resizing and transforming background noise objects (without keypoints)

Here we will define function resize_transform_bg_obj() which transforms noise objects. The difference between function resize_transform_obj() and a new function is that the new function doesn’t transform keypoints, because background noise objects don’t have keypoints.

In the code above a complex transform transforms_bg_obj is defined. This transform rotates image, flips image, adds blur, changes colors, contrast & brightness in a wide range. It will be used to transform background noise objects.

Let’s look how function resize_transform_bg_obj() works:

You’ve seen these image and mask earlier, but now the image is rotated and the brightness here is higher than before.

7. Adding objects of interest (with keypoints) to background

7.1. Adding one object

Here we’ll define function add_obj() which adds object of interest to background:

Function add_obj() returns image composition (background + added objects), mask composition (composition of masks for added objects) and keypoints composition (list of keypoints of added objects).

Also we will define a function visualize_composition_with_keypoints() which visualizes a composition of objects of interest and draws objects’ keypoints:

Let’s add a glue tube to background:

Output:

The initial composition here is background image img_bg.

Array mask_comp = np.zeros((h,w), dtype=np.uint8) is a mask of the initial composition. Since the initial composition is just a background image without any objects on it, its mask contains only zeros.

With adding glue tube to img_bg, it’s mask is added to mask_comp by overlapping initial values with 1 in those pixels, which correspond to the added glue tube on the image composition. We’ve defined the number 1 for the mask of added glue tube by passing parameter idx=1 to function add_obj().

The right picture above is related to compostion mask: numbers 0 are marked in dark purple, numbers 1 are marked in yellow.

Let’s look at the keypoints:

Output:

Keypoints: [[[1079, 203, 1], [232, 694, 1]]]

Let’s add transformed glue tube:

Output:

This time the initial composition img_comp already contains one glue tube, so the mask of the initial composition mask_comp contains numbers 0 and 1.

With adding one more glue tube to the composition, this tube’s mask is added to mask_comp by overlapping initial values with 2 in those pixels, which correspond to the added tube on the image composition. This time we defined the number 2 for the mask of added glue tube by passing parameter idx=2 to function add_obj().

The right picture above is related to compostion mask: numbers 0 are marked in dark purple, numbers 1 are marked in the mix of blue and green, numbers 2 are marked in yellow.

Let’s look at the keypoints:

Output:

8. Adding noise objects (without keypoints) to background

8.1. Adding one object

Here we’ll define function add_bg_obj() which adds noise object to background. To understand how this function works in details, I recommend you to read the article “Adding Objects to Image in Python”.

Function add_bg_obj() returns image composition (background + added objects) and mask composition (composition of masks for added objects).

Let’s see how it works by adding chair to background:

Output:

Let’s add transformed chair:

Output:

8.2. Adding many objects

We want to have dataset with backgrounds as varied as possible. Various backgrounds are good for the training process of keypoint detection neural network. But we have only 60 background images, which is not much if we are going to create dataset of 1000 or more images.

To make backgrounds more varied, we will randomly add noise objects.

Noise objects will be added with function create_bg_with_noise():

Here is the description of parameters:

  • files_bg_imgs is a list with paths to background images;
  • files_bg_noise_imgs is a list with paths to noise objects images;
  • files_bg_noise_masks is a list with paths to noise objects masks;
  • bg_max and bg_min are the target sizes of the longest and the shortest sides of background image;
  • max_objs_to_add is a maximum number of noise objects to be added to background;
  • longest_bg_noise_min and longest_bg_noise_max are the minimum and maximum sizes of the longest side of noise objects. longest_bg_noise_max should be less than bg_min, longest_bg_noise_min should be at least 30.
  • blank_bg should be True if we want background to be a white color instead of random image.

Let’s look how this function works if we set white background:

Output:

Noise objects added to background randomly. Background here is a white color.

This time we will chose a random image for background:

Output:

Noise objects added to background randomly. Background here is a random image.

Note that after each time function create_bg_with_noise() is called, we get a new conposition of noise objects, because they are chosen and placed above the background randomly.

9. Controlling degree of overlapping

A newly added object of interest can partially overlap previously added objects of interest on the composition. Sometimes it can overlap a significant part of another object, like 60% or 70% of its area, or even completely overlap it. But we don’t want this to happen.

We might want to control the degree of overlapping, and make it less then 20% or 30%. Or we might want our objects of interest don’t overlap at all.

Let’s define function check_overlapping() which checks if any of previously added objects is overlapped more than overlap_degree threshold:

After adding a new object to the composition, this function compares areas of not overlapped parts of the previously added objects with original areas of previously added objects. If any of the previously added objects is overlapped by more than overlap_degree, then the function returns False. If all of the previously added objects are overlapped by not more than overlap_degree or not overlapped at all, then the function returns True.

Parameter mask_comp is a composition of masks after a new object is added.

Parameter obj_areas is a list of objects’ original areas in order of their addition as if they were not overlapped. This list shouldn’t include a newly added object while passing it to check_areas() function.

10. Creating synthetic composition

Here we will define function create_composition() which creates synthetic composition of objects:

Here is the description of parameters:

  • img_comp_bg is a background to which objects of interest will be added.
  • max_objs is the maximum number of objects to be added.
  • longest_min and longest_max are the minimum and maximum sizes of the longest side of objects of interest.
  • overlap_degree is the threshold which defines if randomly added object of interest overlaps any of the earlier added objects of interest more than threshold defined by overlap_degree. If at least one of the objects is overlapped too much, then the function goes back to previous composition and adds the object again.
  • max_attempts_per_obj is the number of attempts the function will try to add object without overlapping other objects by more than threshold defined by overlap_degree.

This function returns:

  • img_comp: image with added objects of interest. In our case objects of interst are glue tubes.
  • mask_comp: composition of masks of added objects. Background pixels have value 0, pixels of the first added object have value 1, pixels of the second added object have value 2, etc.
  • keypoints_comp: list of keypoints of added objects.

It’s possible to obtain bounding box for each object from its mask. We will define function create_bboxes_from_mask_comp() which returns coordinates of bounding boxes for objects of interest in form of a list:

Now we are ready to generate a synthetic composition and visualize it (here we set overlap_degree=0, so glue tubes don’t overlap at all):

Output:

Synthetic composition of objects (glue tubes are added randomly above background)

Let’s visualize keypoints and bounding boxes:

Output:

Keypoints: [[[473, 652, 1], [266, 72, 1]], [[1564, 716, 1], [1571, 283, 1]], [[862, 423, 1], [1164, 745, 1]]]
Keypoints and bounding boxes

11. Creating and saving synthetic dataset

We wrote a python script which creates synthetic images and masks. Now we will write script which creates annotations for images.

First, create folders dataset/train/images/, dataset/train/annotations/, dataset/valid/images/, dataset/valid/annotations/ where function generate_dataset() will save images and annotations.

Here is the function which creates a dataset:

Let’s create dataset of 1000 training images and 200 validating images:

Output:

100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [17:13<00:00,  1.03s/it]Generation of 1000 synthetic images is completed. It took 1033 seconds, or 1.0 seconds per image
Images are stored in 'dataset\train\images'
Annotations are stored in 'dataset\train\annotations'
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [03:19<00:00, 1.00it/s]Generation of 200 synthetic images is completed. It took 199 seconds, or 1.0 seconds per image
Images are stored in 'dataset\valid\images'
Annotations are stored in 'dataset\valid\annotations'

Now we have a synthetic dataset and ready to train an object detetion model!

In my case, it took about 20 minutes to generate dataset of 1200 images on a PC with processor Intel Core i7–10700K and 32GB of RAM. One synthetic image was generated in about 1 second.

I’ve also took 23 photos with different compositions of glue tubes, and annotated them by hand. We can use these real photos to test the quality of the object detection model after training.

Here you can download the whole dataset of 1000 synthetic training images, 200 synthetic validating images, and 23 real testing images.

Here is a GitHub repository and notebook with all the steps described above.

12. Example from synthetic dataset

Let’s look at a random image from generated synthetic dataset:

Example of an image from synthetic dataset

Here is how the related json-file with annotations looks like:

{“bboxes”: [[1257, 475, 1901, 603], [199, 154, 637, 463]], “keypoints”: [[[1318, 530, 1], [1874, 547, 1]], [[249, 413, 1], [597, 198, 1]]]}

We see that there are two glue tubes on the image. File with annotations for this image contains coordinates for two bounding boxes. Also, there are coordinates of two keypoints for each glue tube (thus, there are four keypoints in total on this image).

Let’s visualize annotations:

Example of an image from synthetic dataset with visualized keypoints and bounding boxes

Keypoints and annotations are in right places. It means that we have a visual confirmation that our script works properly.

13. Training and testing keypoint detection model

I’ve used the generated synthetic dataset to train Keypoint RCNN model.

Next, I used the trained model to detect keypoints of glue tubes on the video streaming from camera in real time. Here is the result:

Another in-depth articles about computer vision:

How to Train a Custom Keypoint Detection Model with PyTorch: tutorial on how to fine-tune Keypoint RCNN.

How to Create Synthetic Dataset for Object Detection: a simple and quick way to generate a large dataset with help of Python, OpenCV, Numpy and Albumentations.

How to Train an Ensemble of Convolutional Neural Networks for Image Classification: tutorial on how to create an ensemble of DenseNet161, ResNet152 and VGG19 for classification of TinyImageNet.

--

--

Alex P

Machine learning engineer, computer vision enthusiast