Teaching a Computer to See

9 min readApr 18, 2022

This was a blog post I wrote back in 2020 about a project I ran at TWG. We never got around to publishing while I was there (thanks COVID), but it was a fun project with lots of interesting challenges. Special thanks to Ashun Shah and Ben Wendt who did the heavy lifting, Brett Hagman at Rogue Robotics for the custom boards and OCI (formerly OCE) for supporting this project.

**Clockwise from top left:** 3D prototype of the enclosure, board schematics, circuit board design, vehicle occupancy mannequin challenge, finished circuit boards, functional prototype

The Problem

The Ontario Ministry of Transportation began implementing High-Occupancy Vehicle (HOV) lanes on 400-series highways in 2001 to encourage carpooling and reduce congestion for vehicles with multiple passengers. Enforcement of HOV lanes is currently the responsibility of highway patrol. When solo drivers are spotted using an HOV lane, a traffic officer has to pull the vehicle over to issue a $110 ticket. Paradoxically, the act of enforcing faster traffic lanes can actually slow them down, not to mention the safety issue of stopped vehicles on the side of a busy highway. As Ontario considers the use of HOV (and HOT — High Occupancy Toll) lanes throughout the province, they turned to Ontario Centres of Excellence (now the Ontario Center of Innovation, or OCI) for a more efficient, scalable enforcement solution. OCE supports the development and the commercialization of new technologies at academic institutions and private sector companies, and through an OCE program called the Small Business Innovation Challenge, we were awarded funding to build a better mousetrap (pun intended).

We are certainly not the first company to propose an automated solution to this problem. Companies like Xerox, Siemens, 3M and a dozen others have developed systems — which can cost well over $100M — that use roadside cameras to capture images of passing vehicles and, based on 1 or 2 images, figure out how many people are in the vehicle. One Canadian company has proposed using drones to fly alongside the highway to catch HOV cheaters.

With OCE’s support, TWG took a fresh approach to the problem. The available commercial solutions relied on images captured from outside the vehicle, and our initial research discovered that bright sun and inclement weather could significantly limit the effectiveness of these solutions. Additionally, these systems relied on one or two stationary cameras capturing images of vehicles at highway speeds, allowing a fraction of a second to make an accurate determination. Even Xerox’s $100M system advises every potential infraction should be reviewed by a person before issuing a ticket — adding friction and cost to an already expensive system.

Our solution called for an in-car device that would be unaffected by the outside elements. It needed to be inexpensive, accurate, and importantly, protect users privacy and data by following Privacy by Design principals. While we’ve developed a complete system architecture and mobile app to support our device, this case study is going to focus specifically on the technical challenge of building standalone hardware capable of running a convolutional neural network to provide a highly accurate method of occupancy detection.

Solution Overview

So how do you teach a computer to see? More to the point, how do you get a computer to understand what it’s seeing? This is the field of computer vision (CV), and it relies on machine learning (ML) to teach the computer how to make conclusions about the visual data it’s presented with. Just like with humans, learning is a process that takes place over time — typically referred to as training — with the output being an algorithm. Ever-improving algorithms can make increasingly accurate determinations from the data it is presented — what types of objects appear in visual data, for example. Well-trained computer systems with sophisticated cameras and sensors can perform complex tasks like driving a car. Fortunately for us our challenge was much less complex.

Although we later abandoned this approach to lane detection, we initially used a smartphone to develop a quick proof-of-concept for a device that determined if the vehicle was in an HOV lane or not based on the shape and angle of HOV lane markings

In essence what we are doing is feeding a computer images from the inside of a car and asking how many people it ‘sees’. We would then tell the computer which answers were right and which ones were wrong, then have it try again. With each run through the images, the application does more of whatever produced the correct answers and less of whatever produced wrong ones. By repeating this process with more and more image examples, a CV system can ‘learn’ to produce extremely accurate results, especially in narrow use cases such as identifying types of fruit (when it’s only being shown images of fruit) or counting the number of occupants in a car (where it’s looking for human faces). In Machine Learning circles, this system of object identification is known a s a Convolutional Neural Network, or CNN.

Our earliest hardware prototype we experimented with a few different imaging technologies. Our initial imaging technology was based on RF (radio frequency) technology which used a combination of RF transmitters and receivers to scan nearby objects. RF has the advantage of not requiring line-of-sight (it can ‘see through’ most objects) and it performed well in terms of identifying the number of occupants. However the resolution of the images it produced were not sufficient for our purposes — people appeared as a fuzzy, colourful blob, and the sensor would have difficulty distinguishing a human passenger from a large dog, or your dry cleaning hanging in the back seat.

RF reflections were fairly accurate at detecting objects and movement, but the level of detail was not sufficient for training a machine learning algorithm

Thermal cameras proved to be an intriguing option. The output of thermal cameras, as seen below, provide a good balance of privacy while still capturing human-recognizable images. It’s easy to distinguish how many people are in the picture, but it’s difficult to identify who they are. Ultimately we settled on a standard RGB (photo) sensor, opting for an Intel’s Realsense camera as they also include a depth sensor that provides an additional data layer of how far visible objects are from the camera. This would be useful, for example, in ensuring our face detection algorithm couldn’t be tricked by a photo printout of someone’s face (the Face ID feature on iPhones uses a similar method).

While not ideal from a privacy standpoint, RGB proved to be the simplest imaging technology for development purposes, and we knew we could always switch to another imaging method in the future. Any imaging technology that provides sufficient resolution — be it LiDAR, thermal imaging, or laser-based ‘time-of-flight’ could be used to train a CNN to provide accurate occupancy detection.

Teaching A Computer to See

We decided that using a convolutional neural network was the best way to get good occupancy predictions from our image data. Convolutional neural nets are a type of deep neural network that have been the state of the art of image classification since at least 2012. We thought pairing a convolutional net with both image data and the extra dimension of data representing depth (provided by the Realsense depth camera) would give excellent results when predicting occupancy. Applying these image sets to our neural net model very quickly delivered promising results, making the correct number of occupants over 96% of the time.

A Convolutional Neural Network or CNN is a technique in deep learning that detects shapes and characteristics of an image. With lots of images to work with, a CNN will develop the ability to identify the visual characteristics of human features such as eyes, mouths, ears, etc. Since we were using an RGB + Depth camera, the CNN we developed was able to recognize not just the patterns of faces but but using the extra channel of depth data, it recognizes the 3D shapes of faces as well. When switched to a thermal camera, a similar CNN will be developed that recognizes a new set of features, and will teach itself which head/body shapes and heat patterns to look for based on successful matches of previous images. A mannequin in the passenger seat that might fool an RGB camera would create a uniform, distinctly non-human heat signature on a thermal camera.

Having validated the effectiveness of our approach, we then moved our camera setup into actual vehicles to begin collecting a new in-vehicle data set. Initially our image capture setup was a mounted Realsense camera connected to a Macbook which allowed our engineering team to capture roughly 80 sets of data each consisting of 100 in-car images.

Each set represented a single driver or different combinations of vehicles and occupants of various age groups, genders, and ethnicities to have a representative data set of drivers in Ontario. We also tested for edge cases, including a day spent driving around Toronto with a mannequin in the passenger seat.

The Hardware

While our machine learning team continued working with image sets to train and improve our occupancy detection algorithm, we began to prototype the actual hardware. Considering our dashboard device would theoretically need to be produced in mass quantities, we constrained ourselves to low cost ‘off the shelf’ components. Aside from the aforementioned Realsense camera, we selected Nvidia’s Jetson Nano as our CPU. Not dissimilar to the popular Raspberry Pi single-board computers, the $99 Nano boasts an impressive 128 core GPU, which makes it extremely fast at video processing and machine learning applications.

Nvidia’s Jetson Nano sells for around $100

As anyone working with machine learning, deep learning or data science can tell you, the more data you can collect the better. So our initial device prototype is designed with a dual purpose — it provides on-demand occupancy detection, but it also collects image data for training purposes. With these early prototypes we now have a very portable device to collect dozens of data sets per day in multiple vehicles.

Our ‘Version 1’ device runs Linux and automatically boots up when connected to a 12V power supply. LEDs provide system status and occupancy count, and a single button allows an operator to toggle the device through various operating modes. When in default mode, the V.1 device runs an occupancy detection script every 5 minutes and displays the perceived occupancy number using the LED display (the images and corresponding metadata are also saved to the Jetson’s microSD card for further analysis). When in training mode, the user manually inputs the number of occupants, and the device then collects data sets at regular intervals, again writing the images and metadata to the Jetson’s memory card.

Next Steps

While this post focuses specifically on the occupancy detection device, there is much more to the overall system architecture. We have developed a customer-facing mobile app for managing the device, processing payments and dispute resolution. There are wireless protocols for connecting the device to the broader system infrastructure, and a validated method for identifying vehicles travelling in HOV/HOT lanes using inexpensive roadside sensors communicating over ITS/RTTT standards.

In the near term we plan to continue building up our data sets and refining our CNN algorithms in 2020, focusing on edge cases and experimenting with additional imaging technologies. As sophisticated technologies like LiDAR drop rapidly in price (Apple’s new iPad Pro features a LiDAR sensor), and as the performance of inexpensive RF and time-of-flight sensors rapidly improves, we expect the number of viable imaging options will increase.

While TWG has worked on a number of hardware/IoT projects, this one was a rare opportunity to develop the hardware and software in tandem from scratch. OCE’s funding model allowed us room to test and explore a number of key technologies along the way and pursue the optimal solution rather than work to a fixed deliverable that hadn’t been validated.

We look forward to continuing to develop the platform in 2020, and look forward to posting more details soon!

-originally posted February 2020