Comparing baselines for an Object Detection task

Published in

Hexacta Engineering

7 min readMay 21, 2020

On the left the predicted boxes, on the right real boxes.

We have found this global wheat detection competition, so we decided to gather some baselines projects founded in neuronal networks to solve object detection problem with state-of-the-art results and without assuming previous knowledge on it. This is what we have collected and tested.

PyTorch Baseline

Pytorch is an open source machine learning framework, usually used by researches. It has the best tutorials ordered by topics: vision, audio, text in its official page. Also, the competition has a baseline where you could start working. Torchvision is a package that lets you work with Computer Vision in pytorch. Easily, you could import some packages to start developing a solution.

Torchvision provides a Faster R-CNN architecture for object detection:

from torchvision.models.detection import fasterrcnn_resnet50_fpn# load a model pre-trained pre-trained on COCO
model = fasterrcnn_resnet50_fpn(pretrained=True)

COCO dataset has 90 classes, so we must adapt the architecture to our needs, and then fine-tune it. Next picture shows this behavior:

If you see red rectangles, you will detect how are connected box_head.fc7 and box_predictor.* layers. Also, green rectangles shows the outputs of box_predictor: 91 classes (probabilities) and 364 boxes ([x1, y1, x2, y2] per class). In order to understand it more deeply, we recommend to read some theory that you will find below in General References.

Without going into more details, we must edit the last part of the architecture, usually known as “head”, to accept less classes.

# define number of classes
num_classes = 2  # 1 class (wheat) + background# get number of input features 
# for the classifier to plug the new last part
in_features = model.roi_heads.box_predictor.cls_score.in_features# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(
  in_features,
  num_classes
)

Right now, the path to take is the same as any pytorch model.

Pros

It allows us to edit the neural net with python code due to its “immediate execution”, due to the nature of pytorch.
It allows exporting to ONNX, open standard for machine learning interoperability.
It allows modifying its behaviour using a clean API instead of a CLI with config files.

Drawbacks

Its strength means a steeper curve for newbies.
Currently it has only one architectural neural network for object recognition officially supported, though there are others maintained by the community.

YOLOv3 —You Only Look Once

YOLO family are a set of algorithms developed by Redmon with the neuronal network framework Darknet, an implementation built with C. It is known as the fastest algorithm for object detection. Furthermore, its weights are shared in the official page. We have chosen a pytorch implementation built by Ultralytics, but we could have built it from scratch. In the reference section, we have attached an example for Wheat Detection Competition.

File train.csv has the following content:

+-----------+-------+--------+----------------------------+
| image_id  | width | height |            bbox            |
+-----------+-------+--------+----------------------------+
| b6ab77fd7 |  1024 |   1024 | [834.0, 222.0, 56.0, 36.0] |
+-----------+-------+--------+----------------------------+
....

bbox feature represents the localization of a wheat and the size formatted as x, y, width and height, also an image could have many bboxes.

We must read the train.csv that is provided for the competition:

train_df = pd.read_csv(DIR_INPUT+'/train.csv')

Images and boxes need a specific format, where image_id must be the name of the image and boxes file, for instance:

../wheat/images/b6ab77fd7.jpg # image
../wheat/labels/b6ab77fd7.txt # label

The *.txt files contain class number and all bboxes normalized. Last part is achieved dividing x_center and width by image width, and y_center and height by image height. We only detect wheats without distinction of species, so is single class detection task.

train_df[‘x’] = (train_df[‘x’]+train_df[‘w’]/2)/train_df[‘width’]
train_df[‘y’] = (train_df[‘y’]+train_df[‘h’]/2)/train_df[‘height’]
train_df[‘w’] = train_df[‘w’]/train_df[‘width’]
train_df[‘h’] = train_df[‘h’]/train_df[‘height’]

After bboxes file is saved, such as b6ab77fd7.txt, it might contain:

# b6ab77fd7.txt example
0 0.84179 0.234375 0.054687 0.03515
...

For training, the framework needs a *.data file, containing which images should be used. So we create train and validation files:

# train and validation contains which 
# images will be used for train and validation respectively.train, validation = train_test_split(
     train_df.image_id.unique(),
     test_size=0.2
)def save_selected_images(items, stage):
   # This method saves a file with 
   # the absolute path of each image in :items   files = pd.DataFrame(items, columns=['filename'])
   
   # building absolute_name path for each image_id.
   files['path'] = './data/wheat/images/'
   files['extension'] = '.jpg'
   files['absolute_name'] = files.path +\
       files.filename +\
       files.extension
   
   # save file
   files.absolute_name.to_csv(
       f'yolov3/data/wheat_{stage}.txt', 
       header=False, 
       index=False, 
       sep=' '
   )save_selected_images(train, 'train')
save_selected_images(validation, 'validation')

Next, create wheat.data with the following:

classes=1
train=./data/wheat_train.txt
valid=./data/wheat_validation.txt
names=./data/wheat_1cls.names
backup=backup/
eval=wheat

In the training step, we set which yolo flavor we will use by their weights and config. We try with TinyYolo, a lightweight model:

python train.py \
  -data data/wheat.data \
  -cfg cfg/yolov3-tiny-1cls.cfg \
  -weights weights/yolov3-tiny.weights
... set more hyperparameters if you need...

For test and predict purpose, we invoke test.py and detect.py respectively with similar parameters. The complete code could be found here.

Pros

It doesn’t need a regions of interest (ROI) stage so it is faster for real time detection.
Its weights are open source so anyone would be able to implement it with any framework.
Ultralytics implementation is based on a CLI with config files which makes it easier to get you started.

Drawbacks

It only predicts over a limited number of bounding boxes.
Although a CLI implementation, fine-tuning the model will be harder to visualize than an API based approach.
Even if it is implemented in Pytorch, it is not officially supported by it.

Tensorflow — Object Detection API

Tensorflow by default offers a framework to work with object detection tasks. It may have been more friendly if they had shown some tutorial in the official Keras’s page, as an advanced case of Computer Vision. In addition of that, Keras does not offer utilities to build a Faster R-CNN as Pytorch, meaning object detection capabilities by default.

In YOLO we need center points and the size of the box. Instead of that, this framework needs 4 points which represents x_min, y_min, x_max and y_max same as pytorch with faster rcnn.

train_df['xmin'] = train_df['x']
train_df['ymin'] = train_df['y']
train_df['xmax'] = train_df['xmin'] + train_df['w']
train_df['ymax'] = train_df['ymin'] + train_df['h']

Tensorflow OD API has many models pre-trained, we encourage you to select one and make an experiment! We have selected ssd_mobilenet_v1 pretrained on COCO dataset.

As input, the architecture ingests a TfRecord, so we should format our data into this format:

def writeTfRecord(df,filename):    writer = tf.compat.v1.python_io.TFRecordWriter(
      os.path.join(DIR, filename)
    )    grouped = split(df, ‘filename’)    for group in grouped:
        tf_example = create_tf_record(
             group, 
             ‘global-wheat-detection/train’
        )
        writer.write(tf_example.SerializeToString())
     writer.close()
     print(“Done.”)

writeTfRecord(train_df,’train.record’)

We must fine-tune, so we start training with the following command:

python model_main.py \
   --logto stderr \
   --model_dir=training \
--pipeline_config_path=training/ssd_mobilenet_v1_coco.config

If we have installed tensorboard, we could watch the training:

tensorboard --logdir=training/

This tool is so interesting for monitoring the performance of your experiment!

Pros

There are many arquitectural networks available to test, such as SSD, Faster-RCNN, with differents backbones as Resnet, Inception or MobileNet, so change between models is really straightforward at least with default main configuration.
The documentation explains how to run them in cloud environments.
It’s possible to download full API code with trained models and be able to quickly predict own images running locally.

Drawbacks

It is harder to setup the environment as compared to the other approaches, you have to install utilities suchs as protobuf manually and read a huge documentation.
Even though the steps to train a custom dataset are few, there’s no proper structure to perform this and becomes a handmade job, surfing through many code files. An interface to wrap this up is really needed.
The hole project is under research, so there are many cases in which documentation is outdated. Besides it does not offer support for Tensorflow 2.x, so a downgrade to a 1.x version is needed in order to re-train.
Like Ultralytics, it is a framework. It feels like you are loosing control over what it happens, as a black box.

This post was written jointly by Julián Gutiérrez Ostrovsky, Nico Gallinal and me.