Building an object detection pipeline for edge inference

Published in

Eumentis

9 min readFeb 15, 2024

This article is the first in a series of four articles on object detection on Edge devices. In this article, we’ll talk about

Constraints on Edge devices.
Optimizing accuracy and time while selecting a model on an Edge device for a defined task.

Our goal

We have trained an object detection model on our custom dataset, on a web device using the YOLOv5 framework. Our goal is to:

Convert the model into a mobile-optimized format.
Deploy it on a mobile device.
Implement the necessary pre-processing and post-processing steps.
Optimize inference time and prediction accuracy as per the constraints on mobile devices.

The major constraints on mobile devices are processing power and memory capacity.

Challenge

In our use case, the object to be detected was much smaller than the image size. Processing the entire image (without performing the tiling operation) turned out to give very poor accuracy. Due to this we divided the image into tiles of 640x640 pixels and ran inference on each tile separately. This resulted in 49 tiles for an image of size 4800x4800 pixels which in turn multiplied the processing time by 49 times. Our challenge was to minimize processing time without compromising much on accuracy.

Running model on mobile device

This is one of the main objectives of our article. We aim to ensure that our object detection model can run on both Android and iOS devices using our chosen framework, PyTorch Mobile. Below, we outline some of the key aspects:

First, we had to convert our web-based model from the .pt format to the mobile-compatible .ptl format (a post on this is coming soon). The direct conversion from .pt to .ptl did not work as we faced several errors while running the converted model on a mobile device. As a solution, we converted our .pt format to torchscript and from torchscript to .ptl format. This indirect conversion worked and the .ptl model ran successfully on mobile.
In doing the above, we also setup the react-native-pytorch-core library, which allowed us to run inferences on images captured from camera and stored in the device gallery. This library helped us with basic functions as reading, capturing, and saving images, converting images to tensors, and loading models.
React-native-pytorch-core library couldn’t run directly on Android and required modifications to be made to the core react-native files. (Link to modifications on android) We did these modifications and were able to run it successfully on Android. On iOS, however, no changes were required, and it ran smoothly out of the box.

Pre and post processing

On the web, all the pre- and post-processing was done using Numpy and opencv-python. In Python, OpenCV loads and reads an image and converts it into a NumPy array for further processing. On mobile, the library we used, react-native-pytorch-core, converts an image into a tensor after reading it. We did not find any react native compatible libraries to perform algebraic operations on a tensor. Even react-native-pytorch-core had limited options to manipulate tensors, none of which were of use to us. Ultimately, we went with a different approach where we used react-native-photo-manipulator to perform our pre-processing steps of cropping and overlaying images (for getting tiled images), directly on images rather than tensors. For post-processing, we wrote our own implementation of algorithms like non-maximum-suppression and IOU thresholding to obtain the desired output data.

Processing time

As our processing time increased by 49 times, it became unacceptable on mobile devices. In our initial effort to reduce processing time, we implemented a strategy based on the aspect ratio. We padded the original image with a blank white image. This was done to have a squared image having dimensions, which are in multiple of our tile size — 640. The padding operation also resulted in the original image being placed at the center of the overall image. Due to this, we decided to discard the first, and last row and column of tiles, reasoning that our dataset typically had the objects centered within the image, making it impossible to find them near the borders. This adjustment reduced the number of tiles from 49 to 35, decreasing overall inference time.

However, after processing no less than five tiled images, the app crashed due to the accumulation of memory usage for every tile. For subsequent tiled images, no memory was left for usage.

After carefully investigating the cause behind this, we upgraded our JavaScript engine to Hermes (from native Android and IOS engines). This implementation handled the memory usage in a better way, and with each cycle end, memory used by the app was released for usage by the next cycle.

We were now able to test our pipeline and get a processing time for a single image.

The above processing time was not acceptable. We had to target a processing time of 1 minute or less which led us to figure out ways to reduce it further. This could have been done in any one of the following ways:

Use parallel processing on the device.
Increase the tile size to reduce no. of tiles that need to be processed. Total processing time = time for processing single tile * total no. of tiles.
Downgrade the model to a smaller version which would reduce the time to process a single tile. Currently, we are using yolov5 large (yolov5-l) and that has medium (yolov5-m) and small (yolov5-s) versions available.
Upgrade the YOLO series to yolov8, which also includes medium (yolov8-m), small (yolov8-s) and nano (yolov8-n) versions.

We attempted parallel processing first. We tried two approaches with the objective of parallelly calling the JavaScript function that runs the model. Both approaches failed. Even though the function call was done parallelly, the processing was still happening serially. We put this path on hold.

Before moving onto other approaches, we compared the yolov5-l mobile model with our web-based model. We wanted the assurance that the mobile model is performing with similar accuracy as the web model. This was important to do before we tried approaches 2 ,3 and 4 which could result in a reduction in accuracy. For testing future models, we would now use yolov5-l as the comparison base model. The results of this have been presented in subsequent sections.

We went ahead with approach no. 2. By increasing tile size to 960x960 pixels, only 15 tiles needed processing, leading to overall reduction in time compared to the previous 35 tiles per image.

The processing of 15 tiles, each of resolution 960x960, did not substantially change the accuracy compared to the previous processing of 35 tiles, each of resolution 640x640.

So, as expected, there was a significant time reduction when we increased the tile size. Even this processing time was in the unacceptable range and we had to reduce it further. The next planned trial was using the yolov8 series with medium, small, and nano versions and assessing their respective time and accuracy performances.

Since we did not have a custom trained yolov8 model for any of the architecture sizes, we’ll first have to train it on the web and then port it to mobile. Before we invested time in training the models, we wanted to check if the smaller yolov8 models take less time than the yolov5 large (yolov5-l) model. We utilized a pre-trained version of yolov8 medium, small, and nano models to evaluate their performance. The models demonstrated a reduction in processing time compared to the yolov5l (large) model. Yolov5 models show a difference in processing time between a pre-trained and custom-trained model of ~ 2 seconds for a complete image (all tiles). We assumed the same for yolov8 models and hence decided to train yolov8-m, yolov8-s, and yolov8-n models. Below is the time taken for custom yolov8 models on Android and iOS.

Model Results

While we are trying several model versions to achieve a lower processing time, we also need to keep evaluating their accuracy. Our use case had a single class. We categorized it into 4 different categories based on the count of number of objects detected.

Since we are going to evaluate several models, and processing time is significant, we decided to test intermediary models on only 20 images. If accuracy is not acceptable, we’ll not consider that model version for subsequent trials. Note: Final model will be tested on all 210 test images.

Trial results of testing on 20 images are presented below. We used this data to infer that increasing the tile size didn’t impact the model accuracy considerably. Models that were already tested on full test set of 210 images were used for comparison below. For a fair comparison, we used the results of the selected 20 images and not of the 210 images.

As yolov8-m (medium) model did not show a drop in accuracy but instead exhibited a slight increase, we concluded that the Yolov8 series produced better results. Moreover, the smaller yolov8 versions (yolov8-s and yolov8-n) reduced the inference time further. Therefore, we made the decision to test yolov8-s(small) and yolov8-n(nano) on our test set of 210 images.

Below are the results obtained from a comprehensive testing of the yolov8n model on an iOS device using a dataset of 210 images and its comparison with yolov8n-android results

We found there were slight variations in the results between Android and iOS. In our research, we discovered that Android Studio converts portrait images to landscape images during testing. This conversion might introduce minor differences in the android results compared to the original form of the image. However, even when we conducted tests using 15 landscape images on both platforms, some differences persisted, indicating factors beyond image orientation were also contributing to the variations. Additionally, we conducted tests on various Android hardware devices, which also yielded slight differences in the results. This suggests that the underlying hardware can contribute to these variations. However, these discrepancies did not have a significant impact on our tested parameters as indicated in the above table. We conducted an additional trial to check if running multiple trials on the same set of images while keeping the same operating system (OS) would yield similar results. The findings indicated that the results were similar.

To strike a balance between time and accuracy, we opted for the yolov8n model. Though yolov8m had higher accuracy, the time taken by it was almost double (1m 51s) than our required threshold (1m). Yolov8n was much quicker, and the accuracy was also comparable with the web counterpart. After selecting yolov8n, we did a lot of trial and error on our custom dataset and decided to keep the threshold at 0.12.

Stay tuned for our next post…!