Object Tracking – MLT solution for the METI / NEDO Edge AI competition

5 min readJul 21, 2020


by Yoovraj Shinde, Alisher Abdulkhaev, Naveen Kumar, Hajime Kato, Benjamin Ioller

The Ministry of Economy, Trade and Industry (METI) / New Energy and Industrial Technology Development Organization (NEDO) Japan recently organized the 3rd Edge AI Competition (Algorithm Contest 2), addressing the increasing demand to develop innovative edge computing solutions that leverage Deep Learning to accelerate IoT and smart city applications.

The first competition was set to develop object detection and segmentation algorithms, the second one was dedicated to the implementation of algorithms to FPGAs (Field Programmable Gate Arrays). This third competition was on object tracking, focusing on accuracy, model size and inference time.

Task description

Allocate a rectangular area containing objects to be predicted as a bounding box = (x1, y1, x2, y2) to the video captured by the vehicle front camera, and give any unique object ID to the same object in each video. With allocations of multiple bounding boxes to each frame in the videos, the bounding box is represented by specifying four coordinates with the upper left corner as the origin (0,0), upper left coordinate of the object area (x1, y1), and the lower right coordinate (x2, y2).

However, the object to be evaluated (objects to be inferred) is limited to those that satisfy all of the following.
・There are two categories: “Car” and “Pedestrian”
・Objects with 3 or more frames in each video (Frames do not have to be continuous)
・Objects with a rectangle size of 1024 pix² or more

*Even if an object appears in three or more frames in the video, if there are less than 1024 pix² or more in three frames, the object including the size of 1024 pix² or more will not be evaluated.
*Please note that if an object that does not meet the conditions is detected, it is considered as a false detection.

Source: Signate

Gold Medal and 3rd prize for MLT Team

The MLT team Yoovraj Shinde, Alisher Abdulkhaev, Naveen Kumar, Hajime Kato and Benjamin Ioller developed an object detection and object tracking system and scored a gold medal and 3rd prize in the competition. Code and documentation were open sourced and are available on our MLT GitHub.

System description

Our solution is composed of two sequential modules: Object detection and object tracking.

[1] Object Detection

The first step consists of detecting objects in the frame. We decided to go for RetinaNet with ResNet 101 as CNN backbone as our model architecture since it is very efficient with dense and small scale objects while assuring fast inference for edge device applications as a single stage detector. To get a better understanding of the architecture, please refer to the annotated paper “RetinaNet: Focal Loss for Dense Object Detection (ICCV 2017)” by Alisher Abdulkhaev.

One specification was to adopt batch augmentation, detection on augmented images, and finally applying Non-Maximum-Suppression on all the detected objects. We tried out several augmentation techniques (dark-bright / crop side / …), and based on these experiments our final submission used right-left flip augmentation.

Finally, we implemented some heuristics related to bounding box filtering based on our dataset exploration. Our final submission used filtering based on the image position.

All training related information is summarized in ObjectDetectionTraining.md.

[2] Tracking

We formalized the tracking problem as a maximum weighted matching problem for objects in two adjacent frames and solved it using the Hungarian Algorithm.

For matching costs, we utilized several features as:

  • position
  • size
  • image similarity (histogram)

Our tracker keeps the full history of object tracking and estimates the next position of each object in the next frame by linear or quadratic regression.

Then, the tracker matches objects with close position, similar size and similar image. During object matching, the tracker also takes into account object appearance and disappearance.

We added some virtual objects where the objects matched these virtual objects are regarded as newly appeared or disappeared. The disappeared objects are also kept in the tracker for a while and can be matched to some objects in the subsequent frames.

Improvements and lessons learned


There is a lot of room for improvement:

  • Better tuning or new batch augmentation
  • Pedestrian classifier to reduce FP. (developed but not tuned enough to be used for the submission)
  • Automation pipeline for hyperparameter tuning

Lessons learned

  • Public score does not always reflect the private score, which makes the tuning difficult
  • Running dummy inference in the load_model method allows us to reduce inference time
  • Heuristics are very valuable to increase the score
  • We should have spent more time on cleaning the dataset
  • When you think you’re done with data exploration, you’re probably not: Create a summary and a clean report.


Please download data from the competition into the data/ folder:

(Note that Signate may not open the dataset.)

After setting the signate CLI link

cd data/
signate download --competition-id=256


This folder contain our DataExploration notebook:

Source code

Please refer to src/README.md for explanation of our source code.

Submission — Evaluation

Evaluation contains Signate code to run local evaluation.

Run the following command to generate the sample_submit folder.

bash generate_mlt_submission.sh

In order to test submission instance run following:

bash test_submit.sh
cd sample_submit; pwd
python src/main.py

Message to the organizers

Thank you to Signate for hosting this exciting competition. Our team MLT is based in Tokyo and we are really interested in the Edge devices and AI applications at the edge. Having the opportunity to work on a dataset that was “made in Tokyo” is really motivating and gives the feeling to work on a real world project compared to other competitions. On top of that, the field of MoT is a fast growing and evolving research topic, and certainly very challenging due to its complexity and application difficulty. Lastly, the sense of working together as a team towards one goal was a continuous source of motivation.

Yoovraj Shinde, Alisher Abdulkhaev, Naveen Kumar, Hajime Kato, Benjamin Ioller