Investigating video transcoding configs vs bandwidth consumption for drone- detection computer vision models on an Unmanned Ground Vehicle (UGV).

d*classified
d*classified
Published in
13 min readMay 14, 2023

--

High resolution videos are great but consume too much bandwidth — sometimes too much for edge robots to handle. A team of budding defence engineers investigated optimal videotranscoder settings to achieve decent target drone detection results without soaking up bandwidth. Tests were done onboard a limited-compute Unmanned Ground Vehicle (UGV); findings contribute towards configuring UGV sensors to assist in intelligence-gathering missions in unsafe or unknown territories. The team comprised Neo Hao Jun, Ng Jia Wei, Tan Yu Yao; they were mentored by engineers from the Land Systems Programme Centre — Jeremy Tian, Elizabeth Ng, and Benjamin Eu. If you’d like to embark on similar adventures in tinkering, we’d love to have you onboard our Young Defence Scientist Programme (YDSP).

Photo by Josie Weiss on Unsplash

TLDR: Unmanned Ground Vehicles (UGVs) are often deployed to assist in exploration of hazardous or unfamiliar terrain. To achieve this, UGVs need to employ a Simultaneous Localisation and Mapping (SLAM) algorithm to map out unknown areas. Next, the UGV’s onboard camera relays a video feed to a central receiver, where it undergoes processing via a computer vision model. This helps the UGV identify and classify potential targets of interest. A common challenge for robotics developers is to optimize how much information is relayed to receivers (e.g. another robot or a central controller) given bandwidth limitations. You might encounter the same relatable situation when trying to stream a 4K Youtube video with sketchy reception levels. The team tested eight video transcoding options that offered good performance, achieving a 99% success rate with minimal bandwidth utilization. This was used to support a computer vision model for a drone-detection mission, attaining a mean Average Precision (mAP) of 0.437 when considering an Intersection Over Union (IOU) of 50% to 95% overlap. Furthermore, the majority of the detections exhibit a confidence score of 0.4 or higher, underscoring the efficacy of our solution.

-

A primer on Unmanned Ground Vehicles (UGVs)

Unmanned Ground Vehicles (UGVs) have emerged as a game-changer in both military and civilian applications worldwide. Over many humanitarian missions and disaster relief operations, they have demonstrated their worth in critical roles during military land operations and combat. These vehicles are operated either via onboard Artificial Intelligence (AI) or teleoperated by human operators. The deployment of UGVs offers numerous advantages, particularly in areas where there are risks to human safety, such as radioactive sites, terrorist hotspots, or unchartered territories without prior maps (e.g. a partially collapsed building that looks very different from the floor plan). As part of security patrol teams, UGVs could be deployed with sensors to detect potential threats — such as drones — at a greater field-of-view and sensitivity than what the human senses can detect.

Rescue robots could be deployed for urban disaster sites as first responders. Photo by Carl Kho on Unsplash

Selecting and setting up our UGV test rig

We need a UGV that is small, light and intelligent enough to seek out targets like drones.

UGVs are grouped into weight classes were: small (4.5 to 90kg), medium (100 to 250 kg), large (250 to 500kg), very large (500 to 1000kg), and extremely large (> 1000 kg)]. We assessed small UGVs (SUGVs) to be the most suitable for gathering intelligence in hazardous or unexplored areas.

SUGVs are preferred because of their ability to access tight spaces, providing a comprehensive map of the area, and their relatively low cost, allowing multiple vehicles to be deployed simultaneously for increased coverage. We picked an Educational Robot Kit as an example of a constrained, compute robotics test platform, it was equipped with four mecanum wheels, it can carry a large payload and maneuver omnidirectionally.

Mecanum wheels. Photo by Boitumelo Phetla on Unsplash

The Educational Robot Kit was controlled via an application, which posed a challenge when unsupported sensors and peripherals had to be integrated onto it, or when more complex algorithms have to be implemented, both of which would be the case in this project. To overcome this, we interfaced a Raspberry Pi to the robot with a compatible software development kit (SDK) — think of this as upgrading the brains of the robot — extending the capabilities of the robot kit through connection with external peripherals such as the Intel® RealSense™ D435i and T265 for us to run Simultaneous Localization and Mapping (SLAM) and Computer Vision (CV) algorithms. Now we’re ready to rumble!

With an upgraded onboard compute, our UGV levels up to perform automous navigation

In a nutshell, the ROS2 wrapper for Intel RealSense cameras provides ROS2 nodes for easy access to data from these cameras, which can be used with the ROS-enabled Google Cartographer for real-time SLAM. The resulting map is used for robot navigation with obstacle avoidance and dynamic path planning via Nav2. The camera feed also serves as the video stream for object detection using a transcoding model. Navigation information is sent back to the robot via a ROS2 message received by the Raspberry Pi.

Connecting the Raspberry Pi to the Robot Kit via the USB connection mode.

Step 1 — Transcoding the UGV Stream

Our video streams from the UGV were broadcasted to a transcoder through the real-time streaming protocol (RTSP). The transcoding engine used was Wowza Streaming Engine due to its accessible interface and ability to take input and output streams through many different protocols such as RTSP. The RTSP protocol was selected due to its wide usage in Internet Protocol (IP) cameras which allows for easy scalability.

To identify which transcoding configuration is optimal, we adjusted the video stream’s bitrate for each resolution (160p, 240p, 360p, 720p) and each frame rate setting (10 fps, 30 fps). We then rated the performance of each output stream after being annotated with bounding boxes from the CV model through the following criteria: latency, the approximate performance of the model and average bandwidth usage. The optimal transcoding configuration at each resolution and frame rate setting was the one that resulted in an approximate performance of 99% from the CV model at the lowest bitrate possible. The CV model’s performance is assessed based on the stability of its detection of unattended bags: if an unattended bag is detected in 99% of consecutive test frames streamed from the UGV, the model is deemed to have an approximate performance of 99%.

We then tested 2 different combinations of streams on a 2x2 video wall to see how bitrate is changed when all 4 screens are displayed at the same time, as compared to when only 1 screen is displayed. All streams are taken with a stationary camera and no movement in the video captured.

Step 2 — Training and Assessing the CV Model

We picked YOLOv5 as our object detection model known for its fast speed, high accuracy, ease of installation and use (note: YOLO v8 has since been released!). It is pre-trained on the Common Objects in Context (COCO) database, a large-scale, challenging, and high-quality object detection, segmentation, and captioning dataset.

Our CV model builds upon the YOLOv5, ingesting the transcoded RTSP stream to obtain raw frames that can be used for inference. For this project, the CV model is trained to detect unattended bags in an unknown territory, though the model can be easily re-trained for alternative use cases when the need arises.

Specifically, using a basic unattended bag framework, the model compares the pixel distance between identified humans and bags, alerting the user if it picks out a bag that is a distance away from the surrounding individuals. When an alert needs to be sent out, the CV model annotates the bounding boxes of unattended bags onto the original stream. To allow the operator to verify the detection made by the model, bounding boxes of humans and other bags are also drawn, but are not as conspicuous. These annotated frames are then rebroadcasted to RTSP through FFmpeg for ease of further processing.

As an extension to the unattended bag detection model, we have also trained the YOLOv5 model on a custom rotary-winged UAV dataset made by Mehdi Özel. The training produced a fairly consistent and robust UAV detection model and was assessed using a self-labelled test dataset made by picking out 50 frames from online footage of drones.

With multiple resulting output streams from the CV model, it is important to display the consolidated streams neatly on a video wall for observation. We opted for the Milestone XProtect Smart Wall, which enables us to manually tweak the settings for the video streams depending on the context and available bandwidth at that moment

Step 3 — Dig deeper into hardware controls

The Educational Robot Kit did not allow us access to the low-level controls that would be useful in modifying the UGV’s behaviour for specialised missions. What do defence engineers to when presented with such a challenge? DIG DEEPER.

Photo by Magnus Engø on Unsplash

We pulled out the technical manuals to identify a CAN BUS port on the robot, which could be intercepted to send signals to indiviudal components on the Robot rather than running our code through the original equipment manufacturer (OEM)’s frameworks. On a side note, if you’d like to understand how interesting CAN BUS can get — check this out this good resource. Don’t try this at home on your parents’ cars.

Back to business. Through a dash of engineering grit, we managed to receive, interpret and send individual commands through the Robot kit’s CAN network using the Raspberry Pi 4B and Waveshare’s RS485 CAN HAT and mapped it out into Table 1:

Table 1: CAN ID mapped to onboard sensors and modules

Step 4 — Run and test SLAM Algorithm

We ran our SLAM algorithm and it worked as expected. We used Google Cartographer for ROS generating an output map similar to the one shown below. Using the map generated by the SLAM stack, our navigation system enables the Robot to be successfully controlled by a remote operator through inputting desired waypoints, enabling semi-autonomous data gathering to be conducted in areas of interest.

Our UGV executing the route in purple as in moved through an unknown space

Step 5 — Testing various transcoding settings

We tested different resolution settings and decided on 160p, 240p, 360p and 720p as the most suitable. Too high of a resolution is unnecessary and requires too much bandwidth, while too low of a resolution will result in the CV model being unable to detect unattended bags.

Through our tests, we have found that the following optimal transcoding configurations that achieve an approximate CV model performance of 99%, as noted in the Table 2.

Table 2: Optimal configurations for various resolution and fps settings.

Latency and quality must be balanced with the scale of streaming as these two factors are inversely proportional, to work within bandwidth limitations. In addition, higher quality allows for more accurate detection by CV but comes at a bandwidth penalty. This makes it more difficult to scale up the number of streams. Conversely, a lower-quality video stream will have lower latency and be easier to scale up but make it difficult for the CV model or the human operator to detect objects of interest. The same trend is seen when the number of streams is increased, as shown by the Table 3.

Table 3: Combinations of streams on a 2 x 2 video wall.

The best configuration for a video wall depends on its intended use. For instance, if more than 4 screens are to be displayed simultaneously, using a lower resolution such as 240p can help reduce bandwidth consumption. If the video does not have a lot of movement, choosing the 10 fps setting can greatly decrease bandwidth usage, though it may cause slightly greater latency.

Step 6 — Testing our onboard CV model in a drone-detection mission

We deployed our onboard CV model against a test data set for drone detection, and used mean Average Precision (mAP) and F1 score as performance metrics.

The mean Average Precision (mAP) is a standard metric used to analyse the accuracy of an object-detection model, defined as the mean of the average precision obtained every time a new positive sample is recalled.

The F1 score is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted and conveys the balance between the precision and the recall. It is a popular performance metric for classification systems and is generally more useful than accuracy, especially in cases with an uneven class distribution.

Graph of F1 score against Confidence

The F1 score of the model at a specific confidence level denotes how consistent the model is at detecting the target object at that confidence level.to get a mAP50 of 0.971 and mAP50–95 0.437. The mean Average Precision (mAP) is a standard metric used to analyse the accuracy of an object-detection model, defined as the mean of the average precision obtained every time a new positive sample is recalled. The CV model has demonstrated a mAP of 0.437 when considering an Intersection Over Union (IOU) of 50% to 95% overlap, a performance that is substantial for a model trained on fewer than 2,000 labelled images. The F1 score of the model at a specific confidence level denotes how consistent the model is at detecting the target object at that confidence level. As shown by the graph in Figure 9, the biggest drop in the F1 score occurs at about a confidence of 0.4, meaning that a majority of the detections have a confidence of 0.4 or higher.

Precision-Recall Curve

The precision-recall curve shows the trade-off between precision and recall for different thresholds. A large area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. Our model demonstrates high levels of recall and precision, with the precision only dipping to 0.6 at a recall of 0.95, meaning that in the best 95% of the dataset, the model has a precision of at least 60%. This is partly because the model was occasionally unable to detect the drone due to its small size and indistinguishable features, shown below.

Examples of drone detection by our CV model

Concluding notes

  1. Limitations & Future Work

The CV model has limitations in detecting all types of bags accurately due to the constraints of the COCO database. Additionally, the model is only reliable when the bag is upright and directly facing the camera. This can pose a challenge when the UGV cannot adjust the camera angle, and may result in missing unattended bags.

Future work can explore the use of 3D space identification to overcome this challenge. This involves using an additional camera to track the object’s x, y, and z coordinates, even when it is partially blocked by obstacles. By doing so, the tracking ability of the UGV can be enhanced, and unattended bags can be detected more effectively in areas where the camera’s field of view is limited.

2. Recommendations for transcoder configurations

The choice of stream configurations depends on the specific operating conditions, and there are various transcoding options available to the human operator to adjust the stream settings and optimize bandwidth usage depending on the situation. Essentially, if there are many video feeds being viewed simultaneously, the resolution can be reduced as all the streams will be compressed into one screen with a fixed resolution. Similarly, if the scene is mostly static or the objects are moving slowly, sacrificing some frame rate is acceptable, as a small delay in reaction time is insignificant.

However, the available bandwidth remains the most critical factor in determining the appropriate stream configuration. If there is insufficient bandwidth, some stream quality must be compromised to ensure that the stream remains continuous and smooth even under such conditions.

3. Real-life applicability of CV model

The computer vision (CV) model generated from the drone dataset of around 1,500 images was fair given the available project time, but could become even more dependable and consistent with additional training images and accurate labels. Further image augmentation could be done to train the model on drone detection under more complex environments (e.g. adverse weather and poor ambient light). Since the video streams are sent to a central receiver instead of being analysed remotely on the UGVs, the CV model can be updated or replaced with more advanced models and supported by additional resources that require more power, to enhance target detection by performing more comprehensive processing on each frame received by the receiver.

References:

Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012, Yonghye Kwon, Kalen Michael, TaoXie, Jiacong Fang, imyhxy, Lorna, 曾逸夫(Zeng Yifu), Colin Wong, Abhiram V, Diego Montes, Zhiqiang Wang, Cristi Fati, Jebastin Nadar, Laughing, … Mrinal Jain. (2022). ultralytics/yolov5: v7.0 — YOLOv5 SOTA Realtime Instance Segmentation (v7.0). Zenodo. https://doi.org/10.5281/zenodo.7347926

Grunnet-Jepsen, A., Harville, M., Fulkerson, B., Piro, D., Brook, S. & Radford, J. (n.d.). An Introduction to Intel® RealSenseTM Visual SLAM and the T265 Tracking Camera (Version 1.0). Intel Corporation. https://www.intelrealsense.com/download/9275/?-1818208019.1672666677.

Intel Corporation. (2018, 16 April). Unattended Baggage Detection Using Deep Neural Networks in Intel® Architecture. https://www.intel.com/content/www/us/en/developer/articles/technical/unattended-baggage-detection-using-deep-neural-networks-in-intel-architecture.html

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., … & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer, Cham.

Lindholm, V. (2022). Unmanned Ground Vehicles in Urban Military Operations: A case study exploring what the potential end users want.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Schmidt, P. (2019) Intel® RealSenseTM Tracking Camera T265 and Intel® RealSenseTM Depth Camera D435 — Tracking and Depth (Revision 001). Edited by James Scaife Jr., Michael Harville, Slavik Liman, Adam Ahmed, Intel Corporation. https://www.intelrealsense.com/wp-content/uploads/2019/11/Intel_RealSense_Tracking_and_Depth_Whitepaper_rev001.pdf?_ga=2.257078728.299532539.1672835296-1818208019.1672666677

--

--