Since the invention of the automobile, manufacturers have steadily added more safety features and improved car design over time with the goal of keeping drivers safer on the road. Automotive manufacturers have spent millions of dollars researching safety improvements for seatbelts, tires, and pretty much every car piece or part imaginable. Despite all of this investment, driving remains substantially more fatal than alternatives such as air travel in 2019. According to the National Safety Council, approximately 40,000 people died in automotive accidents in the United States alone in 2018. In fact, there were a total of ~500 deaths resulting from plane crashes recorded globally in 2018 — that’s 80 times fewer deaths when compared to car crash fatalities in the US only.
There are many factors that likely contribute to this drastic difference, such as the difficulty in training and certification for commercial pilots, compared to motor vehicle drivers and the amount of cars on the road vs. planes. Despite this, the difference in mortality remains, in my opinion, pretty staggering.
Today, companies such as Tesla, Volvo, and many others are investing into technology such as Artificial Intelligence that seeks to correct the root cause of the issue, and the one we’ve previously ignored — human performance behind the wheel.
Perhaps the most exciting advancement in automotive technology since the invention of the car itself is the birth of self-driving, or autonomous vehicles. Once the topic of Science Fiction, self-driving cars have been made possible through the power of Convolutional Neural Networks (CNNs) and other modeling techniques. But, before we hand over the keys entirely, there are 3 major challenges we must overcome.
> 1. Obtain the data (sensors, cameras)
> 2. Process the data (AI — we’ll focus on this part)
> 3. Act on the data (drive the car)
To highlight a very small sliver of the technological challenges faced by companies like Tesla in bringing autonomous cars to the public, we’ll focus exclusively on a simplified application of step 2, processing image data in order to determine what it is we’re “looking at”. To accomplish this goal, I used a dataset containing 100,000 images of 43 types of labeled street signs to create supervised deep learning models that utilized CNNs and were capable of pretty reliably (more on that later) classifying new images of street signs. For a great summary of what all that means, check out this series of articles by Dhanoop Karunakaran.
When discussing this type of technology, it is important to note the potential ramifications of a mistake. In the case of self-driving cars, a misidentified stop sign or pedestrian could easily cost a life. In a formal study, the significance of misclassifying each sign might be evaluated and given a weight. In this way, a new model could be trained to “err on the side of caution” when classifying more “critical” street signs. And while that all sounds great on paper, the ethics behind these types of choices must also be evaluated seriously. If our weighted inputs caused our new model to make fewer mistakes when classifying stop signs, for example, but performed slightly worse when classifying pedestrian crossing signs, would that be “acceptable”? With all that said, for the purposes of this post, we’ll focus more on the technology itself.
So, just how do you teach a computer how to identify different objects? To touch on that, we’ll begin with a simple example. As humans, when we look at the picture below, we see a cute dog with black and grey fur standing on a speckled rug.
Without thought, we know that the subject of the image isn’t a horse, car, or cat. Even when viewing an example of a heavily edited photo I took of the same dog, as humans we still have no difficulty in identifying that the subject of the image is a dog, even though he is blue. Take a look:
Computers, unsurprisingly, have no notion of what a dog is, much less what one looks like. Instead, computers see images as grouped matrices called tensors. In each matrix, each number corresponds to a specific pixel in the image.
Using convolutional neural networks, we are able to train models to extract image features such as edges, shapes, and colors for use in image classification. Using supervised learning models, we can pass thousands of images of dogs through a model in order to “teach the computer” what a dog looks like. So, while a computer may not appreciate how cute my dog Oliver is, by applying different filters to his portrait, it can determine that Oliver’s facial structure, color, and shape share similarities with other dogs, and that he is very likely to be a dog as well.
The same techniques described can be used to classify any number of target classes, like street signs. To illustrate this process, we’ll use an extremely “light” (read: weak) CNN that I created using the Keras library in Python as an example to explain a bit more about what’s going on “under the hood” during the process of image classification. If you’d like to see the code I used to generate these examples, or view examples of more robust CNNs I trained, as well as take a look at some of the other processes required to train CNNs, such as image pre-processing, exploratory data analysis, etc, be sure to check out my code here.
In this post, however, we’ll use the very basic CNN summarized by the image below to shed some light into the way a computer is able to “see” and “learn”.
So, what does this mean? Why are we doing it and how does it help us to teach a computer to classify street signs?
First, we have our input images, all with dimensions of 32x32x3, or more simply, images that are 32 pixels in height and width, with 3 color channels (red, green, and blue). Because these inputs differ from those of other neural networks, where model inputs are vectors, we’ll need to massage our input images in order to reliably classify them using Deep Learning. For this example, I selected a random image from the dataset. Let’s take a look at the example, as well as display how the image is evaluated by our model:
# Selecting an example image
example = plt.imshow(data['x_train'][2, :, :, :])
As I mentioned previously, computers “see” images as grouped matrices called tensors. Let’s take a look using the same example image:
# Displaying Image Tensor
how_the_computer_sees_it = data['x_train']
Now that we’ve taken a look at the model inputs, let’s take a closer look at the process:
Following the figure above, the first layer applied to the input image is a convolutional layer. In this step, we apply a convolutional kernel to our image, effectively sliding a smaller 3x3 pixel filter over the input image, evaluating their dot products. In the case above, we’ve chosen to apply 32 filters in this convolutional step, resizing the dimensionality of the layer output.
Next, a pooling layer is applied to reduce the dimensionality (and therefore parameters) of the model using a sliding 2x2 filter, selecting the maximum value for each “window” or “chunk” of our parsed image.
Finally, in order to massage our image input into a vectorized format compatible with the dense layers of our neural network we’ll use to classify our images, we flatten the image input into an array.
While we can see the highlighting of edges and shapes in the image above, our outputs don’t appear very “clear” to the human eye. Despite this, even this very simple CNN classified images of street signs from the database relatively well (~80%). Following the creation and evaluation of this baseline model, I trained additional, tuned CNNs and evaluated their performance with an accuracy of >94%. You can check out my code and conclusions by checking out this link.
In a real world context, the task at hand is much more complex than simply classifying signs. In the context of self-driving, objects of many types, from many angles, must all be processed, classified, and acted upon extremely reliably at high speeds, all with human lives hanging in the balance.
Among of the most significant factors to remember when considering the results of our experiment, and others that use neural networks to classify image data, is the image quality used to train our models. While using very small images greatly reduces the time/resources required to train deep learning models, it can also very likely hamper classification performance. This same challenge is amplified greatly when models must account for many more factors all while controlling a vehicle moving at high speeds. This detail highlights the immense challenges faced by companies like Tesla, and the significance of their recent acquisition of DeepScale, a company focused on reducing the resources required to classify image data obtained from sensors in self-driving cars.