How to Create Synthetic Dataset for Computer Vision (Object Detection)

A simple and quick way to generate a large dataset with help of Python, OpenCV, Numpy and Albumentations

Alex P
16 min readJan 10, 2022

Training an object detection model, like YOLOv5, requires a dataset which contains images with objects of interest and annotations (text files with coordinates of objects’ bounding boxes).

For example, on the image below you can see bounding boxes visualized. Each bounding box indicates an object of interest related to a specific class: battery (red), lightbulb (green), padlock (blue).

Example of image with bounding boxes

The more images dataset contains, the better the model will train, because it will see more examples during training process. Dataset which contains 200+ images is ok. Dataset with 1000+ images is much better. The exellent dataset contains 5000+ images.

Note that dataset should not just contain a lot of images, but all the images should be as varied as possible. Objects of interest on those images should be mixed with other objects, presented in different environments, with different backgrounds, in different positions, etc.

One way to create dataset is to create it manually. It means that we take a lot of photos, like the photo above, and annotate them by hand. Such approach is the best, because all photos are real, but it takes a lot of time to create such dataset.

The other way is to create a synthetic dataset automatically. With this approach, cropped photos of objects of interest are randomly scaled, rotated and added to background with python script. Annotations are created with the same script. Under this approach we create images which are not exactly real photos, but objects and background on these images are 100% realistic .

Example of an image from synthetic dataset is below:

Example of an image from synthetic dataset
Example of an image from synthetic dataset: initial background photo (top left), background photo with automatically added objects (botom left), bounding boxes of automatically added objects (bottom right), masks of automatically added objects (top right).

Automated process allows us to spend much less time to create a dataset comparing to manual process. For example, generation of 1000 synthetic images and annotations can take less than an hour. It’s much faster than taking 1000 various photos and annotate them by hand.

Below, I’ll describe all steps of creating synthetic dataset for object detection.

I’ll show how to create synthetic dataset with batteries, lightbulbs and padlocks for training YOLOv5. To do this, we need the following data:

  • cropped photos and masks of the objects of interest (batteries, lightbulbs, padlocks) in various positions;
  • background images (just different photos from the internet);
  • cropped photos and masks of different objects (cars, chairs, guitars, etc.) which will be used as background noise to make background more complex.

I’ve took 26 photos of battery, 23 photos of lightbulb, 21 photos of padlock, and created masks for these objects:

Photos and masks of objects of interest

I’ve collected 30 images which will be used as background. Take a look at some of these images:

Images which will be used as a background

I’ve also collected 107 images of different objects which will be used as background noise. These can be literally any objects which are not battery, lightbulb or padlock:

Objects which will be used as background noise

Download zip-archive with the described above data from here.

Update. You can also check this short video tutorial to see how to create mask of the object with help of Photoshop.

Here is how to use the downloaded data to create a synthetic scene:

  • First, we will randomly choose a background image from folder bg/ and resize it to, for example, 1920x1080.
  • Second, we will randomly pick a background noise object from folder bg_noise/. Then we’ll ramdomly resize, rotate and add it to the background image.
  • We will repeat the second step several times.
  • Third, we will randomly pick an object of interest from folders battery/, lightbulb/, padlock/. Then we’ll ramdomly resize, rotate and add it to the background image on the top of background noise objects from the previous step.
  • We will repeat the third step several times.

Obtained random composition of objects is a synthetic scene.

Synthetic dataset consists of many synthetic scenes.

Let’s create a script which creates synthetic dataset.

1. Imports

Create a new notebook in Jupyter Notebook.

First, we need to import the necessary modules:

2. Paths to files

Unzip downloaded archive with data to folder data/ and create lists with paths to images and masks:

Take a look at the output to better understand the structure of created lists:

The first five files from the sorted list of battery images: ['data\battery\images\1.png', 'data\battery\images\10.png', 'data\battery\images\11.png', 'data\battery\images\12.png', 'data\battery\images\13.png']

The first five files from the sorted list of battery masks: ['data\battery\masks\1.png', 'data\battery\masks\10.png', 'data\battery\masks\11.png', 'data\battery\masks\12.png', 'data\battery\masks\13.png']

The first five files from the sorted list of background images: ['data\bg\bg_1.jpg', 'data\bg\bg_10.jpg', 'data\bg\bg_11.jpg', 'data\bg\bg_12.jpg', 'data\bg\bg_13.jpg']

The first five files from the sorted list of background noise images: ['data\bg_noise\images\1.png', 'data\bg_noise\images\10.jpg', 'data\bg_noise\images\100.png', 'data\bg_noise\images\101.jpg', 'data\bg_noise\images\102.png']

The first five files from the sorted list of background noise masks: ['data\bg_noise\masks\1.png', 'data\bg_noise\masks\10.png', 'data\bg_noise\masks\100.png', 'data\bg_noise\masks\101.png', 'data\bg_noise\masks\102.png']

Later, our script will have a block of code which will randomly pick an object image from these lists, resize it, add augmentations to it, and add it to background.

Also, to set the lower and upper bounds for resizing object images, we set ‘longest_min’: 150, ‘longest_max’: 800 for each object of interest in dictionary obj_dict. It means that the longest side of the image will be not less than 150px, but not more than 800px. You can set other numbers, but I would suggest the lower bound to be at least 30, and the upper bound should be less then both height and width of the background.

3. Images and masks

There are several types of masks:

  • Original mask is the mask where object area is filled with black color (0,0,0), background area is filled with white color (255,255,255).
  • Boolean mask is the mask where object area is filled with True, background area is filled with False.
  • Binary mask is the mask where object area is filled with 1, background area is filled with 0.

For the purpose of this script, we will convert original masks to binary masks.

Here we define a function which returns image of the object in an OpenCV format, and mask of the object in a binary format:

Let’s look how this function works:

Output:

Image file: Data\Padlock\images\1.png
Mask file: Data\Padlock\masks\1.png
Shape of the image of the object: (962, 847, 3)
Shape of the binary mask: (962, 847)

Note that the width of the image is 847, and height is 962. Also, the image has 3 channels. That’s why the shape of the image is (962, 847, 3). Binary mask has the same width and height, but only one channel. That’s why shape of the binary mask is (962, 847).

4. Resizing background images

Images, which will be used as background, have different sizes. For example: 2114x1398, 3456x5184, 1920x1440, 3264x4080, etc. Some of them are horizontal (width > height), others are vertical (height > width).

But we may want all images in the synthetic dataset to have fixed dimensions: 1920x1080 for horizontal ones, and 1080x1920 for vertical ones. To achieve this, we will resize background images with help of resize_img() function:

Let’s look how this function works:

Output:

Shape of the original background image: (3068, 2454, 3)
Shape of the resized background image (desired_max=1920, desired_min=None): (1920, 1535, 3)
Shape of the resized background image (desired_max=1920, desired_min=1080): (1920, 1080, 3)

You can see that the function finds out which side of the image (width or height) is the longest, and resizes image to desired_max size along the longest side. If desired_min is not set, then the shortest side of the image is resized proportionally, otherwise image is resized to desired_min size along the shortest side.

5. Resizing and transforming objects

Function resize_transform_obj() for resizing and transforming objects is similar to the function for resizing background images, but has some additions.

Function resize_transform_obj() resizes image of the object and binary mask of the object. Also, transforms from albumentations library can be passed to the function as argument.

In the code above two complex transforms are defined:

  • transforms_bg_obj rotates image, adds blur, changes colors, contrast & brightness in a wide range. This aggressive transform will be used to transform background noise objects.
  • transforms_obj rotates image and changes contrast & brightness in a narrow range. This negligible will be used to transform objects of interest.

It’s possible to add more options to transforms. Read albumentations documentation to find out how to do that.

Let’s look how function resize_transform_obj() works:

Output:

Shape of the image of the transformed object: (335, 381, 3)
Shape of the transformed binary mask: (335, 381)

You’ve seen these image and mask earlier, but now the shape of the image is (335, 381, 3) instead of (962, 847, 3). Also, it the image is rotated and the brightness here is higher than before. This is how transforms work.

6. Adding object to background

Here we’ll define function add_obj() which adds object to background. To understand how this function works in details, I recommend you to read the article “Adding Objects to Image in Python”.

Function add_obj() returns image composition (background + added objects), mask composition (composition of masks for added objects), and mask of the last added object.

Let’s see how it works by adding padlock to background:

Output:

The initial composition here is background image img_bg.

Array mask_comp = np.zeros((h,w), dtype=np.uint8) is a mask of the initial composition. Since the initial composition is just a background image without any objects on it, its mask contains only zeros.

With adding padlock to img_bg, it’s mask is added to mask_comp by overlapping initial values with 1 in those pixels, which correspond to the added padlock on the image composition. We’ve defined the number 1 for the mask of added padlock by passing parameter idx=1 to function add_obj().

The right picture above is related to compostion mask: numbers 0 are marked in dark purple, numbers 1 are marked in yellow.

Let’s add padlock one more time:

Output:

This time the initial composition img_comp already contains one padlock, so the mask of the initial composition mask_comp contains numbers 0 and 1.

With adding one more padlock to the composition, this padlock’s mask is added to mask_comp by overlapping initial values with 2 in those pixels, which correspond to the added padlock on the image composition. This time we defined the number 2 for the mask of added padlock by passing parameter idx=2 to function add_obj().

The right picture above is related to compostion mask: numbers 0 are marked in dark purple, numbers 1 are marked in the mix of blue and green, numbers 2 are marked in yellow.

7. Adding noise objects to background

We want to have dataset with backgrounds as varied as possible. Various backgrounds are good for training process of object detection neural network. But we have only 30 background images, which is not much if we are going to create dataset of 1000 or more images.

To make backgrounds more varied, we will randomly add noise objects.

Noise objects will be added with function create_bg_with_noise():

Here is the description of parameters:

  • files_bg_imgs is a list with paths to background images;
  • files_bg_noise_imgs is a list with paths to noise objects images;
  • files_bg_noise_masks is a list with paths to noise objects masks;
  • bg_max and bg_min are the target sizes of the longest and the shortest sides of background image;
  • max_objs_to_add is a maximum number of noise objects to be added to background;
  • longest_bg_noise_min and longest_bg_noise_max are the minimum and maximum sizes of the longest side of noise objects. longest_bg_noise_max should be less than bg_min, longest_bg_noise_min should be at least 30.
  • blank_bg should be True if we want background to be a white color instead of random image.

Let’s look how this function works if we set white background:

Output:

Noise objects added to background randomly. Background here is a white color.

This time we will chose a random image for background:

Output:

Noise objects added to background randomly. Background here is a random image.

Note that after each time function create_bg_with_noise() is called, we get a new conposition of noise objects, because they are chosen and placed above the background randomly.

8. Controlling degree of overlapping

A newly added object of interest can partially overlap previously added objects of interest on the composition. Sometimes it can overlap a significant part of another object, like 60% or 70% of its area, or even completely overlap it. But we don’t want this to happen.

We might want to control the degree of overlapping, and make it less then 20% or 30%. Or we might want our objects of interest don’t overlap at all.

Let’s define function check_areas() which checks if any of previously added objects is overlapped more than overlap_degree threshold:

After adding a new object to the composition, this function compares areas of not overlapped parts of the previously added objects with original areas of previously added objects. If any of the previously added objects is overlapped by more than overlap_degree, then the function returns False. If all of the previously added objects are overlapped by not more than overlap_degree or not overlapped at all, then the function returns True.

Parameter mask_comp is a composition of masks after a new object is added.

Parameter obj_areas is a list of objects’ original areas in order of their addition as if they were not overlapped. This list shouldn’t include a newly added object while passing it to check_areas() function.

9. Creating synthetic composition

Here we will define function create_composition() which creates synthetic composition of objects:

Here is the description of parameters:

  • img_comp_bg is a background to which objects of interest will be added.
  • max_objs is the maximum number of objects to be added.
  • overlap_degree is the threshold which defines if a randomly added object of interest overlaps any of the earlier added objects of interest more than threshold defined by overlap_degree. If at least one of the objectsof interest is overlapped too much, then the function goes back to previous composition and adds the object again.
  • max_attempts_per_obj is the number of attempts the function will try to add object without overlapping other objects by more than threshold defined by overlap_degree.

This function returns:

  • img_comp: image with added objects of interest. In our case objects of interst are battery, lightbulb and padlock.
  • mask_comp: composition of masks of added objects. Background pixels have value 0, pixels of the first added object have value 1, pixels of the second added object have value 2, etc.
  • labels_comp: numerical representation of classes of added objects. For example, if objects were added in the following order [lightbulb, battery, padlock, padlock, lightbulb, padlock, battery], then array of labels would be [2, 1, 3, 3, 2, 3, 1]. This relation of classes and numbers is defined in obj_dict in the very begining of the script.
  • obj_areas: list of objects’ areas in order of their addition as if they were not overlapped.

Let’s generate a synthetic composition:

Output:

Synthetic composition of objects (batteries, lampbulbs and padlocks are added randomly above background)

Here you can see batteries, lightbulbs and padlocks, but it’s not always easy to find them all quickly.

Let’s look at the mask of this synthetic composition:

Output:

Composition of masks of added objects

If you look at the mask composition, you can easily find all the objects. Here you can see 2 batteries, 3 lampbulbs and 4 padlocks on the composition.

Let’s look at the array of labels:

Output:

Labels (classes of the objects) on the composition in order of object's addition: [3, 1, 2, 2, 3, 2, 3, 1, 3]

Here you can see that the first added object was a padlock (class 3), then a battery (class 1) was added, and so on…

Let’s also compare original areas of the objects (without overlapping) with areas on the composition:

Output:

Degree of how much area of each object is overlapped:
0.8688065237500786
0.8778115434707346
1.0
1.0
1.0
1.0
1.0
1.0
1.0

Here we see that the first added object is overlapped by 1 - 0.869 = 13.1%, and the second added object is overlapped by 1 - 0.878 = 12.2%. Both 13.1% and 12.2% are less than overlapping threshold of 0.2, which was passed to function create_composition() as parameter overlap_degree.

Also, we can see that the first added object is a padlock (the first element in labels_comp array has label 3), the second added object is a batery (second element in labels_comp array has label 1). If we look at the mask composition again, we can see that one padlock and one battery are overlapped by lampbulbs. It means that we have a visual confirmation that our script works properly.

Let’s also draw bounding boxes for each added object:

Output:

Bounding boxes

You can see that it’s possible to obtain bounding box for each object from its mask. On the image above each class has it’s own color (red for batteries, green for lightbulbs, blue for padlocks).

10. Annotations in YOLO format

We wrote a python script which creates synthetic images and masks. Now we will write script which creates annotations for images.

YOLO format requires annotations to be stored as txt-files. There should be a txt-file for each image, both of them should have the same name. Each txt-file consists of several lines; one line correspondents to one bounding box and consists of five numbers object_class x_center y_center width height.

The first number object_class is the number of object’s class. YOLO format requires object classes should start with 0.

The other four numbers are the coordinates of a bounding box in x_center y_center width height format. Coordinates must be presented in a normalized format (from 0 to 1). To get normalized coordinates, divide x_center and width by background image width, and y_center and height by background image height.

Here is the function which creates annotations for synthetic scene:

Function returns a list of annotations for each object presented on the mask_comp. Let’s look how it works:

Output:

2 0.66042 0.78472 0.18021 0.4287
0 0.28802 0.30139 0.18333 0.58056
1 0.84297 0.82593 0.2224 0.3463
1 0.74557 0.18194 0.1151 0.225
2 0.73385 0.55556 0.04792 0.17778
1 0.24844 0.08472 0.15937 0.16944
2 0.60547 0.40463 0.07656 0.25556
0 0.38333 0.79028 0.21875 0.28241
2 0.21589 0.70602 0.08802 0.31389

Once again, it’s important to note that numeric classes of objects here are starting from 0, instead of 1. In array labels_comp 1 is related to battery, 2 is related to lightbulb and 3 is related to padlock. But, in annotations, classes of the objects should start with 0, it’s YOLO format requirement, so we decrease each number by one, which means that in annotations 0 is related to battery, 1 is related to lightbulb and 2 is related to padlock.

11. Creating and saving synthetic dataset

YOLOv5 requires training images and annotations to be stored in folders train/images/ and train/labels/. Validation part of the dataset should be stored in folders valid/images/ and valid/labels/.

Here is the function which creates a dataset:

Now, create folders dataset/train/images/, dataset/train/labels/, dataset/valid/images/, dataset/valid/labels/ where function generate_dataset() will save images and annotations.

Let’s create dataset of 1000 training images and 200 validating images:

Output:

100%|████████████████████████████████████████████████████████████████████████████| 1000/1000 [1:04:37<00:00,  3.88s/it]Generation of 1000 synthetic images is completed. It took 3878 seconds, or 3.9 seconds per image
Images are stored in 'dataset\train\images'
Annotations are stored in 'dataset\train\labels'
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [12:15<00:00, 3.68s/it]Generation of 200 synthetic images is completed. It took 735 seconds, or 3.7 seconds per image
Images are stored in 'dataset\valid\images'
Annotations are stored in 'dataset\valid\labels'

Great! Now we have a synthetic dataset and ready to train an object detetion model!

In my case, it took about 1 hour 20 minutes to generate dataset of 1200 images on a laptop with processor Intel Core i7–6700HQ and 8GB of RAM while some other tasks were running. One synthetic image was generated in less than 4 seconds.

I’ve also took 43 photos with different compositions of batteries, lightbulbs & padlocks, and annotated them by hand. We can use these real photos to test the quality of the object detection model after training.

Here you can download the whole dataset of 1000 synthetic training images, 200 synthetic validating images, and 43 real testing images.

Here is a GitHub repository and notebook with all the steps described above.

12. Training and testing YOLOv5 model

I’ve used the generated dataset to train YOLOv5x6 model in Google Colab. I’ve set the following hyperparameters for training: image size of 1280, 4 images per batch, 10 epochs.

After training the model, I’ve tested it on real photos. Results were pretty good (P is precision, R is recall, mAP is mean average precision):

    Class  Images  Labels       P       R     mAP@.5    mAP@.5:.95
all 43 354 0.976 0.944 0.956 0.883
Battery 43 133 0.944 0.88 0.895 0.774
Lightbulb 43 110 0.985 0.991 0.995 0.949
Padlock 43 111 1 0.96 0.978 0.926

Let’s look at several test photos with detected objects:

I’ve intentionally made some photos with partially overlapped objects and some photos with objects in complex environments to make it harder for the model to identify the objects of interest. But the trained on synthetic dataset model recognized objects on these photos quite well.

13. Importance of noise objects in synthetic scene

I want you to pay attention to this test photo:

You can see here that two erasers are identified as batteries. Probably, the model found some common features between erasers and batteries (shape + presense of text), and took erasers for batteries.

It’s possible to avoid such situations by adding eraser as noise object while generating synthetic dataset. So, during training process, the model could adjust it’s weights to not mistake erasers for batteries.

Moreover, the more different kinds of noise objects we add, the better model will train not to pay attention to them, it means that there will be less false positive detections.

That’s why it’s important to add noise objects to the background while generating synthetic scene.

Ok, now you know how to generate a synthetic dataset for object detection.

You can substitute battery, lightbulb and padlock with your objects of interest and generate dataset for your needs.

You can also change format of annotations if you need to create dataset not for YOLOv5, but for some other object detection model.

Another in-depth articles about computer vision:

How to Create Synthetic Dataset for Keypoint Detection: a simple and quick way to generate a large dataset with help of Python, OpenCV, Numpy and Albumentations.

How to Train a Custom Keypoint Detection Model with PyTorch: tutorial on how to fine-tune Keypoint RCNN.

How to Train an Ensemble of Convolutional Neural Networks for Image Classification: tutorial on how to create an ensemble of DenseNet161, ResNet152 and VGG19 for classification of TinyImageNet.

--

--

Alex P

Machine learning engineer, computer vision enthusiast