How to create a custom dataset for Computer Vision

Sandali Thissera
Sysco LABS Sri Lanka
8 min readNov 29, 2022



Computer vision is a sub-field of Artificial Intelligence that mainly train computers and systems to gain information from images and videos. Main computer vision types are object detection, image classification, facial recognition, image segmentation, feature matching, pattern identification, etc. Computer vision simplifies the work for humans such as self-driving cars, disease diagnosis, and many more.

Object detection is the most popular stream in Computer Vision. Computer vision is mainly applied when we try to solve real-world problems using images and videos. It has many scenarios where custom object detection is needed such as manufacturing, e-commerce, aviation, etc.

Data annotation takes a major part of the machine learning pipeline and this can be described as the main step of the dataset creation. But data annotation is a manual task and it consumes a lot of time. There are a lot of open-source data annotation tools such as:

  • CVAT (Computer Vision Annotation)
  • LabelImg
  • Label Studio
  • VGG Image Annotator
  • COCO Annotator
  • Make Sense

Format Types of annotation tools:

  • YOLO
  • Image NET
  • TFRecords
  • Pascal VOC
  • VGG Face 2

In this article, we will discuss how to use a custom dataset for object detection using CVAT.

Steps in Object Detection for Computer Vision.

1. Organize your workspace/training file

├─ models/
│ ├─ community/
│ ├─ official/
│ ├─ orbit/
│ ├─ research/
│ └─ …
└─ workspace/
└─ training_demo/

2. Prepare/annotate image datasets

Image collection and Labeling

Before starting the image labeling, we need to collect our dataset. Here, we are using food images as our dataset and then, identifying the texts that need labeling. CVAT(Computer Vision Annotation Tool) is a popular free and open-source interactive image and video annotation tool which is developed by Intel. This tool can be used online or be installed on our local machine as well.

Here we discuss the simpler option — using CVAT online web-based platform.

  • Annotating using CVAT online

Firstly, create an account on

Figure 1.1 — CVAT website

Once you create an account and log in, you will be redirected to the CVAT annotation tool which looks like follows:

Figure 1.2 — CVAT annotation tool user interface
  • Steps for annotating images.

The following diagram shows the steps that need to be followed to annotate the images.

Installing CVAT locally.


  1. WSL for Windows (Install WSL for windows)
  2. Docker desktop or Docker engine with Docker Compose (Docker)
  3. Google Chrome
  4. Git

The following steps need to be followed to install CVAT locally.

After installing, you can follow the same procedure as the diagram above.

3. Generating TFRecords

Images need to be in TFRecord format for object detection models. Here’s the sample code for converting the .xml file to TFRecord. Reference: .xml to TFRecord conversion

import os
import glob
import pandas as pd
import io
import xml.etree.ElementTree as ET
import argparse

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # Suppress TensorFlow logging (1)
import tensorflow.compat.v1 as tf
from PIL import Image
from object_detection.utils import dataset_util, label_map_util
from collections import namedtuple

# Initiate argument parser
parser = argparse.ArgumentParser(
description="Sample TensorFlow XML-to-TFRecord converter")
help="Path to the folder where the input .xml files are stored.",
help="Path to the labels (.pbtxt) file.", type=str)
help="Path of output TFRecord (.record) file.", type=str)
help="Path to the folder where the input image files are stored. "
"Defaults to the same directory as XML_DIR.",
type=str, default=None)
help="Path of output .csv file. If none provided, then no file will be "
type=str, default=None)

args = parser.parse_args()

if args.image_dir is None:
args.image_dir = args.xml_dir

label_map = label_map_util.load_labelmap(args.labels_path)
label_map_dict = label_map_util.get_label_map_dict(label_map)

def xml_to_csv(path):
xml_list = []
for xml_file in glob.glob(path + '/*.xml'):
tree = ET.parse(xml_file)
root = tree.getroot()
filename = root.find('filename').text
width = int(root.find('size').find('width').text)
height = int(root.find('size').find('height').text)
for member in root.findall('object'):
bndbox = member.find('bndbox')
value = (filename,
column_name = ['filename', 'width', 'height',
'class', 'xmin', 'ymin', 'xmax', 'ymax']
xml_df = pd.DataFrame(xml_list, columns=column_name)
return xml_df

def class_text_to_int(row_label):
return label_map_dict[row_label]

def split(df, group):
data = namedtuple('data', ['filename', 'object'])
gb = df.groupby(group)
return [data(filename, gb.get_group(x)) for filename, x in zip(gb.groups.keys(), gb.groups)]

def create_tf_example(group, path):
with tf.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid:
encoded_jpg =
encoded_jpg_io = io.BytesIO(encoded_jpg)
image =
width, height = image.size

filename = group.filename.encode('utf8')
image_format = b'jpg'
xmins = []
xmaxs = []
ymins = []
ymaxs = []
classes_text = []
classes = []

for index, row in group.object.iterrows():
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)

tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height': dataset_util.int64_feature(height),
'image/width': dataset_util.int64_feature(width),
'image/filename': dataset_util.bytes_feature(filename),
'image/source_id': dataset_util.bytes_feature(filename),
'image/encoded': dataset_util.bytes_feature(encoded_jpg),
'image/format': dataset_util.bytes_feature(image_format),
'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
'image/object/class/label': dataset_util.int64_list_feature(classes),
return tf_example

def main(_):
writer = tf.python_io.TFRecordWriter(args.output_path)
path = os.path.join(args.image_dir)
examples = xml_to_csv(args.xml_dir)
grouped = split(examples, 'filename')
for group in grouped:
tf_example = create_tf_example(group, path)
print('Successfully created the TFRecord file: {}'.format(args.output_path))
if args.csv_path is not None:
examples.to_csv(args.csv_path, index=None)
print('Successfully created the CSV file: {}'.format(args.csv_path))

if __name__ == '__main__':

Then create the label map. If you’re using TensorFlow, it requires a label map, which namely maps each of the used labels to an integer value. This label map is used both by the training and detection processes. The label map will look like this:

item {
id: 1
name: 'burgers'

item {
id: 2
name: 'chicken_wings'
item {
id: 3
name: 'mac_and_cheese'

After finishing all of these you will successfully create the custom dataset for object detection training.

4. Configure a simple training pipeline

a. Download pre-trained model

You can download any pre-trained model you want. As an example, we will be using SSD ResNet50 v1 FPN 640x640. After downloading the model, the folder should be looks like follows:

├─ …
├─ pre-trained-models/
│ └─ ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/
│ ├─ checkpoint/
│ ├─ saved_model/
│ └─ pipeline.config
└─ …

b. Configure the training pipeline

Then in pipeline.config file we need to do some changes.

  • First, we need to change the no of classes according to our number of labels.

num_classes: 1 # Set this to the number of different label classes

  • Then give a path to the checkpoint of a pre-trained model

fine_tune_checkpoint: <path_to_checkpoint_of_pre_trained_model>

  • Give path to the labelmap file and to TFRecord file
train_input_reader {
label_map_path: <path_to_label_map> # Path to label map file
tf_record_input_reader {
input_path: <path_to_annotation_tfrecord_file> # Path to training TFRecord file

eval_input_reader {
label_map_path: <path_to_label_map> # Path to label map file
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: <path_to_annotation_tfrecord_file> # Path to testing TFRecord

5. Train a model and monitor its progress

Then train the model using the following command.

python - model_dir=models/my_ssd_resnet50_v1_fpn - pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config

6. Export the resulting model and use it to detect objects.

For exporting the model, we can use the script in models/research/object_detection/ Copy and paste the file into our training demo folder. Then, run the following command to export the model.

python .\ - input_type image_tensor - pipeline_config_path .\models\my_ssd_resnet50_v1_fpn\pipeline.config - trained_checkpoint_dir .\models\my_ssd_resnet50_v1_fpn\ - output_directory .\exported-models\my_model

After executing this command you will have the following folder structure under the exported-models folder.

├─ …
├─ exported-models/
│ └─ my_model/
│ ├─ checkpoint/
│ ├─ saved_model/
│ └─ pipeline.config
└─ …

How to use trained models to detect objects.

After exporting the trained model we’re using the saved checkpoint to detect objects.

First, prepare some test images to test the trained model.

Then, use a function to load the image from its location.

def load_image(path_to_image):
img1 = cv.imread(path_to_image)
image_np1 = load_image_into_numpy_array(path_to_image)

# The input needs to be a tensor, convert it using `tf.convert_to_tensor`.
input_tensor1 = tf.convert_to_tensor(image_np1)

# The model expects a batch of images, so add an axis with `tf.newaxis`.
input_tensor1 = input_tensor1[tf.newaxis, ...]

input_tensor1 = np.expand_dims(image_np1, 0)
detectionss = detect_fn(input_tensor1)

# All outputs are batches tensors.
# Convert to numpy arrays, and take index [0] to remove the batch dimension.
# We're only interested in the first num_detections.
num_detectionss = int(detectionss.pop('num_detections'))
detectionss = {key: value[0, :num_detectionss].numpy()
for key, value in detectionss.items()}
detectionss['num_detections'] = num_detectionss

# detection_classes should be ints.
detectionss['detection_classes'] = detectionss['detection_classes'].astype(np.int64)

image_np_with_detectionss = image_np1.copy()


foodname = str(category_index[detectionss['detection_classes'][0]]['name'])
cv.imshow("non", image_np_with_detectionss)
food_names = foodname

# read csv file
with open(r'path_to_csv', 'r') as file:
csvreaders = csv.reader(file, delimiter=',')
for rows in csvreaders:
if food_names in rows[0]:
ingredients_to_strs = str(rows[1])
ingredientss = ingredients_to_strs.split(",")
for products in ingredientss:
x = get_response(products.lower())


def main():

if __name == "__main__":

Here is an example after applying an image for object detection. Figure 1.3 is the test image and the following image shows the detection with a bounding box and label.

Figure 1.3 — Test image
Figure 1.4 — Detected image with label and prediction

Best practices

  1. Draw the bounding box to fit the entire object and try to make it as small as possible.
  2. Make complete annotations along with the label.
  3. Try to draw the bounding box in full view.

Other useful features of CVAT

  • Interactors

Available DL models from this category can be used to label any object.

  • Detectors

Detectors are used to annotate one frame automatically.

  • Trackers

Trackers are used to annotating objects with bounding boxes. Similar to Interactors, the available models can be used to annotate any objects.

Limitations of CVAT

  • Frequent performance issues.
  • No data versioning.
  • Limited annotator performance insights.
  • No task workflows.

Finally, preparing a dataset and configuring the pipeline is simple as above. Always, save your work!

So it’s simple! Let us know what you think as well.

Happy learnings!