On-Device AI Optimization — Leveraging Driving Data to Gain an Edge

Published in

Engineering@Nauto

11 min readSep 28, 2020

The Nauto AI platform is focused on improving driver safety. Our devices monitor driver behavior, vehicle movement, external traffic conditions, and other contextual data to help actively prevent dangerous situations before they happen.

At the heart of our platform, machine learning powers everything from detecting tailgating and driver distraction to predicting imminent collisions. As a result, the capability of our models directly impacts the quality of our safety offerings. For instance, object detection model accuracy determines how early our Predictive Collision system is able to alert, while distraction model accuracy determines how effectively we’re able to keep a driver’s eyes on the road. In many situations a one percent improvement in accuracy can make the difference between a near miss and a deadly collision.

My goal as an ML engineer is to maximize the predictive performance of our models.

Understanding the Latency-Accuracy Trade-off

However, that objective isn’t always so simple for an aftermarket AI platform like ours. For starters, we must fit all of our models and algorithms onto a computing platform the size of a smartphone, and then be able to run them together in real-time. In order to keep our safety offerings economically accessible, we’re also limited by the class of processors we’re able to fit in our devices. For these reasons, we must not only optimize our models for predictive performance, but also for computational efficiency.

An internal Nauto study evaluating object detection performance on driving data. Each trial combines a feature extractor (color-coded) with an input resolution and a meta-architecture such as SSD, R-FCN, or Faster RCNN. R-FCN and Faster RCNN trials contain an extra independent variable, region proposal count, which ranges from 50 to 300. Latency is measured as inference time on a server-side NVIDIA GPU (acting as a proxy for on-device performance).

This is where the latency/accuracy trade-off curve comes into play. In the field of computer vision there is generally a positive correlation between the accuracy of a model and the amount of time it takes to make a prediction. For instance, one may look to improve accuracy by increasing input resolution or expanding the depth of the network, but that will also likely increase inference latency. With the requirement of running real-time models on mobile processors, our ultimate goal is to maximize accuracy within a specific compute constraint.

In the figure above, for example, the constraint might be an inference latency of under 50ms. Given a search space that spans different feature extractors, meta-architectures, and input resolutions, we may employ two primary tools to achieve our goal. First, we can try to improve the accuracy of an efficient model that already satisfies our compute constraint (blue). Or, we may seek to speed up an accurate but heavier model through performance optimizations (green).

In practice, the two are not always orthogonal; improving along one axis may negatively impact the other. But sometimes that may still be a trade-off we are willing to make. For instance, if increasing input resolution leads to significantly better performance on small objects like pedestrians, we may willingly accept the speed penalty, or alternatively look for other performance optimizations to recoup the loss.

The bottom line is that due to the accuracy/latency relationship the two approaches are complementary: improving our models in either axis gives us more wiggle room to improve in the other, landing us closer to our ultimate goal.

It is also worth mentioning that more recently, a technique known as Neural Architecture Search (NAS) has tremendously aided in the development of efficient, mobile CNNs. Simply put, NAS helps automate the model design process by employing search tools such as reinforcement learning or supernetworks to discover optimal architectures. When targeted towards specific compute platforms, it can generate models not only optimized for accuracy but also for computational efficiency.

Leveraging the Domain Characteristics of Driving Data

When I began my journey at Nauto a year ago, I was commissioned to replace our existing object detector with a more efficient model. After some research and experimentation, I arrived at a new architecture that was able to achieve an accuracy improvement of over 40% mAP* relative to our current detector, while running almost twice as fast. The massive improvement comes largely thanks to the mobile-targeted NAS design framework pioneered by works such as MnasNet and MobileNetV3.

*mAP (mean average precision) is a common metric for evaluating the predictive performance of object detectors.

Relative to our current model, the new detector reduces device inference latency by 43.4% and improves mAP by 42.7%.

Informed Channel Reduction

However, the most interesting improvements surfaced as I looked for ways to further push the boundary of the latency/accuracy curve. During my research I came across an intriguing finding by the authors of Searching for MobileNetV3, a new state-of-the-art classification backbone for mobile devices. They discovered that when adapting the model for the task of object detection, they were able to reduce the channel counts of the final layers by a factor of 2 with no negative impact to accuracy.

The underlying idea was simple: MobileNetV3 was originally optimized to classify the 1000 classes of the ImageNet dataset, while the object detection benchmark, COCO, only contains 90 output classes. Identifying a potential redundancy in layer size, the authors were able to achieve a 15% speedup without sacrificing a single percentage of mAP.

Compared to popular benchmark datasets like ImageNet (1000) and COCO (90), the driving data we work with at Nauto consists of a minuscule number of distinct object classes.

Intrigued, I wondered if I could take this optimization further. In our perception framework we are only interested in detecting a handful of classes such as vehicles, pedestrians, and traffic lights — in total amounting to a fraction of the 90–1000 class datasets used to optimize state-of-the-art architectures. So I began to experiment with reducing the late stage layers of my detector by factors of 4, 8, and all the way up to 32 and beyond. To my surprise, I found that after applying aggressive channel reduction I was able to reduce latency by 22%, while also improving accuracy by 11% mAP relative to the published model.

My original hope was to achieve a modest inference speedup with limited negative side-effects — I never expected to actually see an improvement in accuracy. One possible explanation is that while the original architecture was optimal for the diverse 90 class COCO dataset, it is overparameterized for the relatively uniform road scenes experienced by our devices. In other words, removing redundant channels may have improved overall accuracy in a similar way to how dropout and weight decay help prevent overfit.

At any rate, this optimization illustrates how improving along one axis of the latency/accuracy curve can impact performance in the other. In this case, however, the unintentional side-effect was positive. In fact, we broke the general rule of the trade-off by making a simultaneous improvement in both dimensions.

Applying aggressive channel reduction to the late-stage layers of the detector resulted in a 22% speedup and an 11% improvement in mAP relative to the baseline model.

Task-specific Data Augmentation

The success I had with channel reduction motivated me to look for other ways to leverage the uniqueness of driving data. Something that immediately came to mind was a study done by an old colleague of mine while I worked at my previous company, DeepScale. Essentially, he found that conventional data augmentation strategies like random flip and random crop**, while generally effective at reducing overfit, can actually hurt performance on driving data. For his application, simply removing the default augmentors resulted in a 13% improvement in accuracy.

**Random flip selects images at random to be flipped (typically across the vertical axis). Random crop selects images to be cropped and resized back to original resolution (effectively zooming in).

Again, the underlying idea is simple: while benchmark datasets like COCO and ImageNet contain a diverse collection of objects captured by various cameras from many different angles, driving data is comparatively uniform. In most applications the camera positions are fixed, the intrinsics are known, and the image composition will generally consist of the sky, the road, and a few objects. By introducing randomly flipped and zoomed-in images, you may be teaching your model to generalize to perspectives it will never actually experience in real life. This type of overgeneralization can be detrimental to overall accuracy, particularly for mobile models where predictive capacity is already limited.

Initially, I had adopted the augmentation scheme used by the original authors of my model. This included the standard horizontal flipping and cropping. I began my study by simply removing the random flip augmentor and retraining my model. As I had hoped, this single change led to a noticeable improvement in accuracy: about 4.5% relative mAP. (It must be noted that while we do operate in fleets around the world including left-hand-drive countries like Japan, my model was targeted for US deployment.)

In the default scheme, random crop (top) will often generate distorted, zoomed-in images that compromise object proportions and exclude important landmarks. Random horizontal flip (bottom), while not as obviously harmful, dilutes the training data with orientations the model will never see in production (US). The constrained-crop augmentor takes a more conservative approach; its outputs more closely resemble the viewing angles of real world Nauto devices.

I then shifted my focus to random crop. By default, the selected crop was required to have an area between 10% to 100% of the image, and an aspect ratio of 0.5 to 2.0. After examining some of the augmented data, I quickly discovered two things: first, many of the images were so zoomed-in that they excluded important context clues like lane-markers; and second, many of the objects were noticeably distorted in instances where a low aspect ratio crop was resized back to model resolution.

I was tempted at first to remove random crop entirely as my colleague had, but I realized there is one important difference between Nauto and full stack self-driving companies. Because we’re deployed as an aftermarket platform in vehicles ranging from sedans to 18-wheelers, our camera position varies significantly across fleets and individual installations. My hypothesis was that a constrained, less-aggressive crop augmentor would still be beneficial as a tool to reflect such a distribution.

I began experimenting by fixing the aspect ratio to match the input resolution and raising the minimum crop size. After a few iterations, I found that a constrained augmentor using a fixed ratio and minimum crop area of 50% improved accuracy by 4.4% mAP relative to the default cropping scheme. To test my hypothesis, I also repeated the trial with random crop completely removed. Unlike it had for my colleague, the no-augmentation scheme actually reduced mAP by 5.3% (1% worse than baseline), confirming that conservative cropping can still be beneficial in applications where camera position varies across vehicles.

The final scheme (no-flip, constrained-crop) in total yields a 9.1% relative improvement over the original baseline (flip, crop) and a 10.2% improvement over augmentation at all.

The baseline augmentation scheme (grey) consists of random horizontal flip and random crop (aspect ratio ∈ [0.5, 2.0] and area ∈ [0.1, 1.0]). Removing random flip improved mAP by 4.5%. From there, removing random crop reduced mAP by 5.3% (-1% relative to baseline). Using a constrained crop (fixed ratio, area ∈ [0.25, 1.0]) improved mAP by 7.9% relative to baseline. And finally, the most constrained crop (fixed ratio, area ∈ [0.5, 1.0]) resulted in the largest improvement: 9.1% relative to baseline.

Data-Driven Anchor Box Tuning

I’ll wrap it up with one more interesting finding. The majority of today’s object detection architectures form predictions based on a set of default anchor boxes. These boxes (also sometimes called priors) typically span a range of scales and aspect ratios in order to better detect objects of various shapes and sizes.

**SSD default anchor boxes.** Liu, Wei et al. “SSD: Single Shot MultiBox Detector.” Lecture Notes in Computer Science (2016): 21–37. Crossref. Web.

At this point, I was focusing my efforts on improving the core vehicle detector that drives our forward collision warning system (FCW). While sifting through our data, I couldn’t help but once again notice its uniformity compared to competition benchmarks; overall image composition aside, the objects themselves seemed to fall into a very tight distribution of shapes and sizes. So I decided to take a deeper look at the vehicles in our dataset.

**Object distribution of FCW dataset.** Scale is calculated for each object as bounding box height relative to image height (adjusted by object and image aspect ratios). The average object is relatively small, with a median scale of 0.057 and a 99th percentile of 0.31. Objects are also generally square, with a median aspect ratio of 1.02 and 99th percentile of 1.36.

As it turns out, the majority of objects are relatively square, with more than 96% falling between aspect ratios of 0.5 to 1.5. This actually makes a lot of sense in the context of FCW, as the most relevant objects will generally be the rear profiles of vehicles further ahead on the road. The size distribution follows more of a long tail distribution, but even so, the largest objects occupy less than three fourths of the image in either dimension, while 99% occupy less than a third.

Once again, I went back to reevaluate my initial assumptions. Up until now I had adopted the default set of anchor boxes used by the original authors, which ranged in scale between 0.2 and 0.9, using aspect ratios of ⅓, ½, 1, 2, and 3. While this comprehensive range makes sense for general-purpose object detection tasks like COCO, I wondered if I would again be able to find redundancy in the context of autonomous driving.

I began by experimenting with a tighter range of aspect ratios, including {½, 1, 1½} and {¾, 1, 1¼}. Surprisingly, the largest gain in both speed and accuracy came simply from using square anchors only, which effectively cut the total anchor count by a factor of 5. I then turned my attention to box sizes, realizing that the default range of [0.2, 0.9] overlapped with less than 5% of the objects in my dataset. Shrinking the anchor sizes to better match the object distribution yielded another modest improvement.

In total, the new anchor boxes yielded an almost 20% inference speedup and a 2% relative mAP improvement across all object classes, sizes, and shapes.

The baseline model uses anchor boxes with scales ∈ [0.2, 0.9] and aspect ratios ∈ {⅓, ½, 1, 2, 3}. Simply removing all but the square boxes resulted in a speedup of 18.5% with no negative impact to accuracy. Further tuning the boxes to match the scale range of the object distribution resulted in a modest 2.1% relative gain in mAP.

Note: while the benchmarks within each optimization study are conducted in controlled experiments, a number of factors changed between individual studies. I chose not to present a cumulative improvement from start to finish in the interest of keeping this post short and focused on the 3 major optimizations.

Impact

In the very beginning I mentioned how improving computational performance is complementary to maximizing accuracy. Thanks to the speedup achieved through the new architecture and all the optimizations, we were able to significantly increase our model input resolution without exceeding our original compute constraints.

The hope is that in addition to boosting overall accuracy, a higher resolution will significantly improve our ability to detect small objects like pedestrians or distant vehicles — a crucial factor for collision prevention.

Final Thoughts

My intention for this blog was not to create a guide for squeezing more mAP out of mobile CNNs. Many of the improvements I found are very specific to Nauto’s current use case, and will likely not transfer to other vision problems. They may even become irrelevant as our own object detection requirements evolve.

Images from the popular benchmark COCO dataset compared to typical road scenes faced by Nauto devices.

Rather, the takeaway I hope to leave with is that any real-world application of machine learning is defined by a set of characteristics that makes the problem distinct. By identifying these traits, you may find surprising places to make improvements. In my case, I found a discrepancy in object distribution and scene composition between Nauto’s data and the benchmark data used to develop the base model. As a result, I was able to make simple tweaks that led to simultaneous improvements in speed and accuracy, breaking the general law of the trade-off.

The optimizations waiting for you may not be immediately obvious, and may even go against common sense. Large networks, data augmentation, a comprehensive set of anchor boxes — these design choices are widely adopted across academic research and commercial applications of ML, and rightfully so. They are well-proven to enhance the generalization and predictive capacity of computer vision models. But more likely than not, your specific problem will have quirks that set it apart from the rest. As long as you’re willing to challenge the general practices that have worked well for others, you may find surprising ways to gain an edge.