Training Custom Object Detector Step by Step

Asmaa Mirkhan

Published in

İÜC Bilgisayar Kulübü

11 min readApr 22, 2020

The original version of this post is available here at dl.asmaamir.com ✨

🌱 Introduction

✨ Tensorflow object detection API is a powerful tool that allows us to create custom object detectors depending on pre-trained, fine tuned models even if we don’t have strong AI background or strong TensorFlow knowledge.
💁‍♀️ Building models depending on pre-trained models saves us a lot of time and labor since we are using models that maybe trained for weeks using very strong machines, this principle is called Transfer Learning.
🗃️ As a data set I will show you how to use OpenImages data set and converting its data to TensorFlow-friendly format.

🚩 Development Pipeline

👩‍💻 Environment Preparation
🖼️ Image acquiring
🤹‍♀️ Image Organization
🤖 Model Selecting
👩‍🔧 Model Configuration
👶 Training
👮‍♀️ Evaluation
👒 Model Exporting
📱 Converting to tflite

- 🤕 While you are applying the instructions if you get errors you can check out 🐞 Common Issues section at the end of the article
- 🤯 If your model isn’t learning as you expect, check out model debugging notes here (beta 🧪)

👩‍💻 Environment Preparation

🔸 Environment Info

🐍 Python version: 3.7
🔥 TensorFlow version: 1.15

🥦 Conda env Setting

🔮 Create new env

🥦 Install Anaconda
💻 Open cmd and run:

# conda create -n <ENV_NAME> python=<REQUIRED_VERSION>
conda create -n tf1 python=3.7

⚡ Activate the new env

# conda activate <ENV_NAME>
conda activate tf1

🔽 Install Packages

💥 GPU vs CPU Computing

🚀 Installing TensorFlow

# for gpu
conda install tensorflow-gpu=1.15# for cpu
conda install tensorflow=1.15

📦 Installing other packages

conda install pillow Cython lxml jupyter matplotlibconda install -c anaconda protobuf

🤖 Downloading models repository

🤸‍♀️ Cloning from GitHub

A repository that contains required utils for training and evaluation process
Open CMD and run in E disk and run:

# note that every time you open CMD you have
# to activate your env again by running:
# under E:\>conda activate tf1git clone https://github.com/tensorflow/models.git
cd models/research

🧐 I assume that you are running your commands under E disk.

🔃 Compiling Protobufs

💻 Windows

# under (tf1) E:\models\research>
for /f %i in ('dir /b object_detection\protos\*.proto') ^do protoc object_detection\protos\%i --python_out=.

🐧 Linux

# under /models/research
$ protoc object_detection/protos/*.proto --python_out=.

📦 Compiling Packages

#under (tf1) E:\models\research>
python setup.py build
python setup.py install

🚩 Setting Python Path Temporarily

💻 Windows

#under (tf1) E:\models\research>
set PYTHONPATH=E:\models\research;E:\models\research\slim

🐧 Linux

# under /models/research
$ export PYTHONPATH=`pwd`:`pwd`/slim

👮‍♀️ Every time you open CMD you have to set PYTHONPATH again

👩‍🔬 Installation Test

🧐 Check out that every thing is done

💻 Command

# under (tf1) E:\models\research>
python python object_detection/builders/model_builder_tf1_test.py

🎉 Expected Output

Ran 17 tests in 0.833sOK (skipped=1)

🖼️ Image Acquiring

👮‍♀️ Directory Structure

🏗️ I suppose that you created a structure like:

|___ models
|___ demo
    |___ annotations
    |___ eval
    |___ images
    |___ inference
    |___ OIDv4_ToolKit
    |___ OpenImagesTool
    |___ pre_trainded_model
    |___ scripts
    |___ training

🤖 model: the repo here

📄 annotations: will contain generated .csv and .record files

👮‍♀️ eval: will contain results of evaluation

🖼️ images: will contain image data set

▶️ inference: will contain exported models after training

🔽 OIDv4_ToolKit: the repo here (OpenImages Downloader)

👩‍🔧 OpenImagesTool: the repo here (OpenImages Organizer)

👩‍🏫pre_trained_model: will contain files of TensorFlow model that we will retrain

👩‍💻 scripts: will contain scripts that we will use for pre-processing and training processes

🚴‍♀️ training: will contain generated check points during training

🚀 OpenImages Dataset

🕵️‍♀️ You can get images in various methods
👩‍🏫 I will show process of organizing OpenImages data set
🗃️ OpenImages is a huge data set contains annotated images of 600 objects
🔍 You can explore images by categories from here

🎨 Downloading By Category

OIDv4_Toolkit is a tool that we can use to download OpenImages dataset by category and by set (test, train, validation)

💻 To clone and build the project, open CMD and run:

(tf1) E:\demo>git clone https://github.com/EscVM/OIDv4_ToolKit.git(tf1) E:\demo>cd OIDv4_ToolKit(tf1) E:\demo\OIDv4_ToolKit>pip install -r requirements.txt

⏬ To start downloading by category:

# python main.py downloader --classes <OBJECT_LIST> --type_csv <TYPE># TYPE: all | test | train | validation(tf1) E:\demo\OIDv4_ToolKit>python main.py downloader --classes Apple Orange --type_csv validation

👮‍♀️ If object name consists of 2 parts then write it with '_', e.g. Bell_pepper

🤹‍♀️ Image Organization

🔮 OpenImagesTool

👩‍💻 OpenImagesTool is a tool to convert OpenImages images and annotations to TensorFlow-friendly structure.
🙄 OpenImages provides annotations ad .txt files in a format like:<OBJECT_NAME> <XMIN> <YMIN> <XMAX> <YMAX> which is not compatible with TensorFlow that requires VOC annotation format
💫 To do that synchronization we can do the following

💻 To clone and build the project, open CMD and run:

(tf1) E:\demo>git clone https://github.com/asmaamirkhan/OpenImagesTool.git(tf1) E:\demo>cd OpenImagesTool/src

💻 Applying Organizing

🚀 Now, we will convert images and annotations that we have downloaded and save them to images folder

# under (tf1) E:\demo\OpenImagesTool\src># python script.py -i <INPUT_PATH> -o <OUTPUT_PATH>python script.py -i E:\pre_trainded_model\OIDv4_ToolKit\OID\Dataset -o E:\pre_trainded_model\images

👩‍🔬 OpenImagesTool adds validation images to training set by default, if you wand to disable this behavior you can add -v flag to the command.

🏷️ Creating Label Map

⛓️ label_map.pbtxt is a file that maps object names to corresponded IDs
➕ Create label_map.pbtxtfile under annotations folder and open it in a text editor
🖊️ Write your objects names and IDs in the following format

item {
  id: 1
  name: 'Hamster'
}item {
  id: 2
  name: 'Apple'
}

👮‍♀️ id:0 is reserved for background, so don' t use it
🐞 Related error: ValueError: Label map id 0 is reserved for the background label

🏭 Generating CSV Files

🔄 Now we have to convert .xml files to csv file
🔻 Download the script xml_to_csv.py script and save it under scripts folder
💻 Open CMD and run:

👩‍🔬 Generating train csv file

# under (tf1) E:\demo\scripts>python xml_to_csv.py -i E:\demo\images\train -o E:\demo\annotations\train_labels.csv

👩‍🔬 Generating test csv file

# under (tf1) E:\demo\scripts>python xml_to_csv.py -i E:\demo\images\test -o E:\demo\annotations\test_labels.csv

👩‍🏭 Generating TF Records

🙇‍♀️ Now, we will generate tfrecords that will be used in training precess
🔻 Download generate_tfrecords.py script and save it under scripts folder

👩‍🔬 Generating train tfrecord

# under (tf1) E:\demo\scripts># python generate_tfrecords.py --label_map=<PATH_TO_LABEL_MAP># --csv_input=<PATH_TO_CSV_FILE> --img_path=<PATH_TO_IMAGE_FOLDER># --output_path=<PATH_TO_OUTPUT_FILE>python generate_tfrecords.py --label_map=E:\demo\annotations\label_map.pbtxt --csv_input=E:\demo\annotations\train_labels.csv --img_path= E:\demo\images\train --output_path=E:\demo\annotations\train.record

👩‍🔬 Generating test tfrecord

# under (tf1) E:\demo\scripts>python generate_tfrecords.py --label_map=E:\demo\annotations\label_map.pbtxt --csv_input=E:\demo\annotations\test_labels.csv --img_path= E:\demo\images\test --output_path=E:\demo\annotations\test.record

🤖 Model Selecting

🎉 TensorFLow Object Detection Zoo provides a lot of pre-trained models
🕵️‍♀️ Models differentiate in terms of accuracy and speed, you can select the suitable model due to your priorities
💾 Select a model, extract it and save it under pre_trained_model folder
👀 Check out my notes here to get insight about differences between popular models

👩‍🔧 Model Configuration

⏬ Downloading config File

😎 We have downloaded the models (pre-trained weights) but now we have to download configuration file that contains training parameters and settings
👮‍♀️ Every model in TensorFlow Object Detection Zoo has a configuration file presented here
💾 Download the config file that corresponds to the models you have selected and save it under training folder

👩‍🔬 Updating config File

You have to update the following lines:

// number of classes
num_classes: 1 // set it to total number of classes you have// path of pre-trained checkpoint
fine_tune_checkpoint: "E:/demo/pre_trained_model/ssd_mobilenet_v1_quantized_300x300_coco14_sync_2018_07_18/model.ckpt"// path to train tfrecord
tf_record_input_reader {
input_path: "E:/demo/annotations/train.record"
}// number of images that will be used in evaluation process
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false// I suggest setting it to total number of testing set to get accurate resultsnum_examples: 11193
}eval_input_reader: {tf_record_input_reader {
// path to test tfrecord
input_path: "E:/demo/annotations/test.record"
}// path to label map
label_map_path: "E:/demo/annotations/label_map.pbtxt"// set it to true if you want to shuffle test set at each evaluation
shuffle: false
num_readers: 1
}

🤹‍♀️ If you give the whole test set to evaluation process then shuffle functionality won’t affect the results, it will only give you different examples on TensorBoard

👶 Training

🎉 Now we have done all preparations
🚀 Let the computer start learning
💻 Open CMD and run:

# under (tf1) E:\models\research\object_detection\legacy>
# python train.py --train_dir=<DIRECTORY_TO_SAVE_CHECKPOINTS>
# --pipline_config_path=<PATH_TO_CONFIG_FILE>python train.py --train_dir=E:/demo/training --pipeline_config_path=E:/demo/training/ssd_mobilenet_v1_quantized_300x300_coco14_sync.config

🕐 This process will take long (You can take a nap 🤭, but a long nap 🙄)
🕵️‍♀️ While model is being trained you will see loss values on CMD
✋ You can stop the process when the loss value achieves a good value (under 1)

👮‍♀️ Evaluation

🎳 Evaluating Script

🤭 After training process is done, let’s do an exam to know how good (or bad 🙄) is our model doing
🎩 The following command will use the model on whole test set and after that print the results, so that we can do error analysis.
💻 So that, open CMD and run:

# under (tf1) E:\models\research\object_detection\legacy>
# python eval.py --logtostderr --pipline_config_path=<PATH_TO_CONFIG_FILE>
# --checkpoint_dir=<DIRECTORY_OF_CHECKPOINTS> --eval_dir=<DIRECTORY_TO_SAVE_EVAL_RESULTS>python eval.py --pipeline_config_path=E:/demo/training/ssd_mobilenet_v1_quantized_300x300_coco14_sync.config --checkpoint_dir=--pipeline_config_path=E:/demo/training --eval_dir=--pipeline_config_path=E:/demo/eval

👀 Visualizing Results

✨ To see results on charts and images we can use TensorBoard for better analyzing
💻 Open CMD and:

👩‍🏫 Training Values Visualization

🧐 Here you can see graphs of loss, learning rate and other values
🤓 And much more (You can investigate tabs at the top)
😋 It is feasable to use it while training (and exciting 🤩)

# under (tf1) E:\> 
tensorboard --logdir=E:/demo/tarining

👮‍♀️ Evaluation Values Visualization

👀 Here you can see images from your test set with corresponded predictions
🤓 And much more (You can inspect tabs at the top)
❗ You must use this after running evaluation script

# under (tf1) E:\>
tensorboard --logdir=E:/demo/eval

🔍 See the visualized results on localhost:6006 and
🧐 You can inspect numerical values from report on terminal, result example:

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.708
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.984
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.868
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.289
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.623
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.767
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.779
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.781
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.781
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.300
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.703
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.824

🎨 If you want to get metric report for each class you have to change evaluating protocol to pascal metrics by configuring metrics_set in .config file:

eval_config: {
...
metrics_set: "weighted_pascal_voc_detection_metrics"
...
}

👒 Model Exporting

🔧 After training and evaluation processes are done, we have to make the model in such a format that we can use
🦺 For now, we have only checkpoints, so that we have to export .pb file
💻 So, open CMD and run:

# under (tf1) E:\models\research\object_detection>
# python export_inference_graph.py --input_type image_tensor
# --pipeline_config_path <PATH_TO_CONFIG_FILE>
# --trained_checkpoing_prefix <PATH_TO_LAST_CHECKPOINT>
# --output_directory <PATH_TO_SAVE_EXPORTED_MODEL>python export_inference_graph.py --input_type image_tensor --pipeline_config_path=E:/demo/training/ssd_mobilenet_v1_quantized_300x300_coco14_sync.config --trained_checkpoing_prefix E:/demo/training/model.ckpt-16438 --output_directory E:/demo/inference/ssd_v1_quant

If you are using SSD and planning to convert it to tflite later you have to run

# under (tf1) E:\models\research\object_detection>
# python export_tflite_ssd_graph.py --input_type image_tensor
# --pipeline_config_path <PATH_TO_CONFIG_FILE>
# --trained_checkpoing_prefix <PATH_TO_LAST_CHECKPOINT>
# --output_directory <PATH_TO_SAVE_EXPORTED_MODEL>python export_tflite_ssd_graph.py --input_type image_tensor --pipeline_config_path=E:/demo/training/ssd_mobilenet_v1_quantized_300x300_coco14_sync.config --trained_checkpoing_prefix E:/demo/training/model.ckpt-16438 --output_directory E:/demo/inference/ssd_v1_quant

📱 Converting to tflite

💁‍♀️ If you want to use the model in mobile apps or tflite supported embedded devices you have to convert .pb file to .tflite file

📙 About TFLite

📱 TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices.
🧐 It enables on-device machine learning inference with low latency and a small binary size.
😎 TensorFlow Lite uses many techniques for this such as quantized kernels that allow smaller and faster (fixed-point math) models.
📍 Official site

🍫 Converting Command

💻 To apply converting open CMD and run:

# under (tf1) E:\>
# toco --graph_def_file=<PATH_TO_PB_FILE>
# --output_file=<PATH_TO_SAVE> --input_shapes=<INPUT_SHAPES>
# --input_arrays=<INPUT_ARRAYS> --output_arrays=<OUTPUT_ARRAYS>
# --inference_type=<QUATIZED_UINT8|FLOAT> --change_concat_input_ranges=<true|false>
# --alow_custom_ops
# args for QUATIZED_UINT8 inference
# --mean_values=<MEAN_VALUES> std_dev_values=<STD_DEV_VALUES>toco --graph_def_file=E:\demo\inference\ssd_v1_quant\tflite_graph.pb --output_file=E:\demo\tflite\ssd_mobilenet.tflite --input_shapes=1,300,300,3 --input_arrays=normalized_input_image_tensor --output_arrays=TFLite_Detection_PostProcess,TFLite_Detection_PostProcess:1,TFLite_Detection_PostProcess:2,TFLite_Detection_PostProcess:3 --inference_type=QUATIZED_UINT8 --mean_values=128 --std_dev_values=128 --change_concat_input_ranges=false --allow_custom_ops

🐞 Common Issues

🥅 nets module issue

ModuleNotFoundError: No module named 'nets'

This means that there is a problem in setting PYTHONPATH, try to run:

# under (tf1) E:\models\research>
set PYTHONPATH=E:\models\research;E:\models\research\slim

🗃️ tf_slim module issue

ModuleNotFoundError: No module named 'tf_slim'

This means that tf_slim module is not installed, try to run:

# under (tf1) E:\models\research>
pip install tf_slim

🗃️ Allocation error

2020-08-11 17:44:00.357710: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit:                 10661327
InUse:                 10656704
MaxInUse:              10657688
NumAllocs:                 2959
MaxAllocSize:           3045064

For me it is fixed by minimizing batch_size in .config file, it is related to your computations resources

train_config: {
....
batch_size: 128
....
}

❗ no such file or directory error

train.py tensorflow.python.framework.errors_impl.notfounderror no such file or directory

🙄 For me it was a typo in train.py command
📍 Related discussion 1
📍 Related discussion 2

🤯 LossTensor is inf issue

LossTensor is inf or nan. : Tensor had NaN values

👀 Related discussion is here, it is common that it is an annotation problem
🙄 Maybe there is some bounding boxes outside the image boundaries
🤯 The solution for me was minimizing batch size in .config file

🙄 Ground truth issue

The following classes have no ground truth examples

👀 Related discussion is here
👩‍🔧 For me it was a misspelling issue in label_map file,
🙄 Pay attention to small and capital letters

🏷️ labelmap issue

ValueError: Label map id 0 is reserved for the background label

👮‍♀️ id:0 is reserved for background, We can not use it for objects
🆔 start IDs from 1

🔦 No Variable to Save issue

Value Error: No Variable to Save

👀 Related solution is here
👩‍🔧 Adding the following line to .config file solved the problem

train_config: {
...
fine_tune_checkpoint_type:  "detection"
...
}

🧪 pycocotools module issue

ModuleNotFoundError: No module named 'pycocotools'

👀 Related discussion is here
👩‍🔧 Applying the downloading instructions provided here solved the problem for me (on Windows 10)

🥴 pycocotools type error issue

pycocotools typeerror: object of type cannot be safely interpreted as an integer.

👩‍🔧 I solved the problem by editing the following lines in cocoeval.py script under pycocotools package (by adding casting)
👮‍♀️ Make sure that you are editting the package in you env not in other env.

self.iouThrs = np.linspace(.5, 0.95, int(np.round((0.95 - .5) / .05)) + 1, endpoint=True)self.recThrs = np.linspace(.0, 1.00, int(np.round((1.00 - .0) / .01)) + 1, endpoint=True)

💣 Loss Exploding

INFO:tensorflow:global step 440: loss = 2106942657570782838784.0000 (0.405 sec/step)INFO:tensorflow:global step 440: loss = 2106942657570782838784.0000 (0.405 sec/step)INFO:tensorflow:global step 441: loss = 7774169971762292326400.0000 (0.401 sec/step)INFO:tensorflow:global step 441: loss = 7774169971762292326400.0000 (0.401 sec/step)INFO:tensorflow:global step 442: loss = 25262924095336287830016.0000 (0.404 sec/step)INFO:tensorflow:global step 442: loss = 25262924095336287830016.0000 (0.404 sec/step)

🙄 For me there were 2 problems:

First:

Some of annotations were wrong and overflow the image (e.g. xmax > width)
I could check that by inspecting .csv file
Example:

Second:

Learning rate in .config file is too big (the default value was big 🙄)
The following values are valid and tested on mobilenet_ssd_v1_quantized (Not very good 🙄)

learning_rate: {
  cosine_decay_learning_rate {
    learning_rate_base: .01
    total_steps: 50000
    warmup_learning_rate: 0.005
    warmup_steps: 2000
  }
}

🥴 Getting convolution Failure

Error : Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

It may be a Cuda version incompatibility issue
For me it was a memory issue and I solved it by adding the following line to train.py script

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

📦 Invalid box data error

raise ValueError('Invalid box data. data must be a numpy array of '
ValueError: Invalid box data. data must be a numpy array of N*[y_min, x_min, y_max, x_max]

🙄 For me it was a logical error, in test_labels.csv there were some invalid values like: file123.jpg,134,63,3,0,0,-1029,-615
🔖 So, it was a labeling issue, fixing these lines solved the problem
👀 Related discussion

🔄 Image with id added issue

raise ValueError('Image with id {} already added.'.format(image_id))
ValueError: Image with id 123.png already added.

☝ It is an issue in .config caused by giving value to num_example that is greater than total number of test image in test directory

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_examples: 1265 // <--- this value was greater than total test images
}

Training Custom Object Detector Step by Step

🌱 Introduction

🚩 Development Pipeline

👩‍💻 Environment Preparation

🔸 Environment Info

🥦 Conda env Setting

🔮 Create new env

⚡ Activate the new env

🔽 Install Packages

💥 GPU vs CPU Computing

🚀 Installing TensorFlow

📦 Installing other packages

🤖 Downloading models repository

🤸‍♀️ Cloning from GitHub

🔃 Compiling Protobufs

💻 Windows

📦 Compiling Packages

🚩 Setting Python Path Temporarily

👩‍🔬 Installation Test

🖼️ Image Acquiring

👮‍♀️ Directory Structure

🚀 OpenImages Dataset

🎨 Downloading By Category

🤹‍♀️ Image Organization

🔮 OpenImagesTool

💻 Applying Organizing

🏷️ Creating Label Map

🏭 Generating CSV Files

👩‍🔬 Generating train csv file

👩‍🔬 Generating test csv file

👩‍🏭 Generating TF Records

👩‍🔬 Generating train tfrecord

👩‍🔬 Generating test tfrecord

🤖 Model Selecting

👩‍🔧 Model Configuration

⏬ Downloading config File

👩‍🔬 Updating config File

👶 Training

👮‍♀️ Evaluation

🎳 Evaluating Script

👀 Visualizing Results

👩‍🏫 Training Values Visualization

👮‍♀️ Evaluation Values Visualization

👒 Model Exporting

📱 Converting to tflite

📙 About TFLite

🐞 Common Issues

🥅 nets module issue

🗃️ tf_slim module issue

🗃️ Allocation error

❗ no such file or directory error

🤯 LossTensor is inf issue

🙄 Ground truth issue

🏷️ labelmap issue

🔦 No Variable to Save issue

🧪 pycocotools module issue

🥴 pycocotools type error issue

💣 Loss Exploding

🥴 Getting convolution Failure

📦 Invalid box data error

🔄 Image with id added issue

🧐 References

Written by Asmaa Mirkhan