Real Time Object Tracking System using Thermal Camera
Light intensity is very low at night, making it difficult for humans to see and recognize objects around them. Humans need additional tools to see well at night in the form of lighting devices like streetlights and flashlight, or even other advanced technologies like night vision or thermal imaging camera.
But how does thermal camera work?
All objects emit heat signature, also known as infrared energy. A thermal imager detects and measures infrared energy of objects. The camera then converts that infrared data into an electronic image which shows surface temperature of the object which can be further processed.
Building the product
In this project, I was assigned to a team to develop a real time person tracking system that works on night. We use FLIR Lepton as our thermal imaging camera, NVIDIA Jetson Nano as a single board computer for processing unit, and two servo motors to rotate the camera so it could track the target. The main objective of the product is to recognize whether there is a person on a frame while maintaining and adjusting camera position to track the target in real time.
There are several steps that we did to build the tracking system.
- First, we researched about state-of-the-art object detection algorithm. Of all the options, we chose YOLO object detection architecture.
- Then, we collect the dataset of person in thermal image and train it to our model.
- After that, we implemented the model into NVIDIA Jetson Nano and assembled it with the other components.
- The last thing to do is to do a test and calibrate our products to achieve a better performance.
You Only Look Once (YOLO)
You Only Look Once (YOLO) is one of well-known state of the art object detection algorithm. This algorithm is a breakthrough where the object detection case is identified as a regression problem to divide the image into certain smaller pixel groups and associate them with the probability of an object being detected with a bounding box and the probability of a class.
We use the newest version of YOLO (at that time), which is YOLOv4. And because we are going to deploy it in a single board computer, we use the tiny version. Below is graph showing benchmark comparison for object detection algorithms related to processing-time or FPS (Frame per Second).
The advantage of YOLO over other object detection algorithms is that its processing time is much faster because it only processes each pixel once. We need algorithm with a fast-processing time (high FPS) because our tracking system works in real time.
In this project, we use darknet framework to accommodate YOLO architecture. Darknet is working framework of Artificial Neural Network (ANN) which is often used to develop deep learning projects. Darknet is written in the C and CUDA programming languages. It has several advantages having fast processing time, easy to install on any device, and we can choose to run the framework either by CPU or GPU.
If you want to learn more detailed information about Darknet and YOLOv4, you can read it in Aleksey Bochkovskiy’s blog here.
We have to collect the dataset manually because it is difficult to find it on the internet since thermal image research are still very rare. The dataset was collected by recording videos of person using FLIR Lepton 3.5 camera. We varied the background spots and use as many persons sample as we can to prevent overfitting or the model may only work in certain cases. We also varied the gestures like facing backwards, raising hands, squatting, and so on to improve the performance of the model itself.
After the dataset has been collected, we labeled every image to annotate human position in the picture. Labels provided must follow the format of YOLO. The following is an example of an image in the dataset along with a label for the image.
For each image in the dataset, there is a .txt file as a label with the same name as the image file, containing the object annotation information on the image. Here are some YOLO annotating rules that we follow in this label.
- The number of lines in the label file indicates the number of detected objects in the image.
- The first column with a value of 0 is the class index of the detected object, namely human.
- The second and third columns are the coordinates of the center point of the bounding box (x,y).
- The fourth and fifth columns are the width and height of the bounding box created (w,h).
- The second to fifth column has a decimal value between 0 and 1 because the column is the result of the normalization process to the image size, which is 416x416 pixels.
We used graphical image annotation tool created by Tzu Ta Lin so that we could labeled hundreds of images easier and faster. You can access it from his Github page here.
After the dataset has been collected, the next step is to upload the dataset to the training repository and divide it into three different folders, namely training, validation, and testing. The ratio used in this project is 7:2:1 or 679 images for training, 194 images for validation, and 97 images for the testing process.
After that, we train the model using the dataset. The goal is to obtain weights or multiplier coefficients on neurons in neural networks. In this research, training was conducted using Google Collaboratory to get good GPU resources and speed up the training process. The darknet framework also supports the training process so darknet is chosen as the code-base for this process. Below is the code snippet to run the training process using the darknet in terminal.
./darknet detector train data/obj.data cfg/custom-yolov4-tiny-detector.cfg yolov4-tiny.conv.29 -dont_show -map
Code snippet explanation:
- The first words in the snippet above, darknet detector, are the default commands for initializing features in the darknet framework.
- The train command is given to train the model.
- The first argument, data/obj.data, is a file containing all the paths to the dataset that have been partitioned into the previous three sections.
- The second argument, cfg/custom-yolov4-tiny-detector.cfg, contains the architectural configuration of tiny YOLOv4.
- The third argument, yolov4-tiny.conv.29, is a file containing pre-trained weights or weights files that have been pre-trained to a certain stage to perform convolution and feature extraction.
- The next two arguments, –dont_show and -map, serve to reduce the output in the terminal during the training process and only focus on the mAP (mean Average Precision) metric.
Our custom model performed well. It got 96,61% on its mAP value. We then decided to download the weights file and deployed it to Jetson Nano to do the live test.
If you want to learn more about the training process, I’ve included this blog from Roboflow about training YOLOv4 with custom dataset in the references. It really helps us a lot while we were working with this project.
Results and live testing
Here is a video demo of the product.
The system succeeded in detecting person with high confidence levels (over 90%). It also continuously adjusts its camera position by rotating the servo so that the targeted person remains in the frame.
We also tried to test its FPS (Frame per Second) performance by grabbing the FPS data from the log when the program executed. FPS value calculated by dividing one over execution duration for each frame. There are hundreds of recorded FPS data so we need to create a Python script to grab and plot it as below.
From the graph above, it can be seen that the FPS value that often appears are 14. There are times when the value is 13 or falls close to 1. However, it seems that the information is the noise generated by the Jetson Nano device or the camera used because the amount is much less than 14. This also shows that the YOLOv4 algorithm is the right choice in real time object detection cases.