This post explores my work on the second project of the Udacity Self-Driving Car Nanodegree program. My goal is to help other students and professionals who are in the early phases of building their intuition in machine learning (ML) and artificial intelligence (AI).
With that said, please keep in mind that I am a product manager by trade (not an engineer or data scientist). So, what follows is meant to be a semi-technical yet approachable explanation of the ML concepts and algorithms in this project. If anything covered below is inaccurate or if you have constructive feedback, I’d love to hear from you.
The goal of this project is to build a neural network that recognizes traffic signs in Germany.
Udacity’s benchmark for the project is to achieve at least 93% accuracy (on the validation set). However, my personal goal was to surpass the human level performance benchmark of 98.8% accuracy identified in this paper by Mrinal Haloi from the Indian Institute of Technology.
The basic steps of the project are as follows:
- Load the data set provided by Udacity
- Explore, summarize and visualize the data set
- Design, train and test a model architecture
- Use the model to make predictions on new images
- Analyze the softmax probabilities of the new images
- Summarize the results with a written report
My code and a detailed view of the outputs for each step are outlined here in this Jupyter notebook.
Data Summary & Exploration
Throughout this section, I use the Numpy, Pandas, and Matplotlib libraries to explore and visualize the traffic signs data set.
Data Size & Shape
- Size of training set: 34,799 (67%)
- Size of the validation set: 4,410 (9%)
- Size of test set: 12,630 (24%)
- Shape of a traffic sign image: (32, 32, 3)
- Number of unique classes/labels: 43
Before designing the neural network, I felt it was important to visualize the data in various ways to gain some intuition for what the model will “see.” This not only informs the model structure and parameters, but it also helps me determine what types of preprocessing operations should be applied to the data (if any).
There are a few fundamental ways I used visualizations to inform my decisions:
- Inspect a sample of images. Do the images correspond with the expected number of color channels? — i.e., if
channels=3then the images should appear in color/RGB not grayscale. How clear are the images? Is there anything that makes the signs hard to recognize (e.g. bad weather, darkness, glare, occlusions)?
- Review a sample of the labels. Do the labels make sense? Do they accurately correspond with images in the data set?
- Create a histogram showing the distribution of classes/labels. How balanced is the dataset? Are there certain classes that dominate? Are there others that are under-represented?
Image & Label Sample
Below is a sample of the original images before they undergo any preprocessing. Overall, the image quality is good and the labels make intuitive sense. However, immediately you notice a few things we’ll want to adjust during preprocessing:
- Many of the signs are hard to recognize because the images are dark and have low contrast.
- There is little variation in the sign shape and viewing angle. Most of the pictures are taken with a straight on view of the sign, which is good for the core data set. However, in real life, signs are viewed from different angles.
- The signs are void of any deformations or occlusions. Again, this is good because we need a clean set of training samples, but in real life, signs are sometimes damaged, vandalized, or only partially visible. Essentially, we want the model to recognize signs even when the shape is distorted, much like humans can. So, augmenting the training set with a variety of distortions is important.
As you can see below, the distribution is not uniform. The largest classes have 10x the number of traffic sign images than the smallest classes. This is expected given that in real-life there are certain signs which appear more frequently than others. However, when training the model, I wanted a more uniform distribution so that each class has the same number of training examples and the model, therefore, has an equal number of opportunities to learn each sign.
Given the issues identified above, I decided to explore the following preprocessing operations (in addition to data normalization):
I used this Scikit histogram equalization function, which not only normalizes the images but also enhances local contrast details in regions that are darker or lighter than most of the image (link to source code). You can see from the image sample below this also inherently increases the brightness of the image.
- Increase the total number of images so that the model has more training examples to learn from.
- Create an equal distribution of images — i.e., the same number of images per class, so that the model has a sufficient number of training examples in each class. I initially tested models on sets of 4k images per class and found that models performed better with more images. I ultimately generated a set of 6k images per class for the final model.
- Apply affine transformations. Used to generate images with various sets of perturbations. Specifically: rotation, shift, shearing, and zoom. But, I decided not to apply horizontal/vertical flipping as this didn’t seem applicable to real-life use cases.
- Apply ZCA whitening to accentuate edges.
- Apply two types of color transformations:
- (1) Color channel shifts. This was done to create slight color derivations to help prevent the model from overfitting on specific color shades. This intuitively seemed like a better strategy than grayscaling.
- (2) Grayscaling. This was performed separately _after_ all of the above transformations. Due to the high darkness and low contrast issues, applying grayscale before the other transformations didn’t make sense. It would only make the contrast issue worse. So, I decided to test the grayscale versions as a separate data set to see if it boosted performance (spoiler alert: it didn’t).
Below are snippets of code that takes the already normalized images (with contrast-enhanced) and applies the other transformations listed above. It outputs a new training set with 6k images per class, including the set of normalized training images (link to source code).
Augmented Image Samples
Below is a sample of a traffic sign images after the complete set of normalization, contrast enhancement, and augmentation steps listed above.
Here is a sample of images with **grayscaling** then applied. At first glance, it doesn’t appear that grayscaling improves the images in any meaningful way. So, my hypothesis was that the grayscaled versions would perform the same or worse than the augmented images (this turned out to be correct).
I tested a variety of models (> 25 different combinations). Ultimately, I settled on a relatively small and simple architecture that was easy to train and still delivered great performance. My final model consisted of the following layers:
Below is a snapshot of the model (here is a link to the source code). You can see that I use: (1) a ReLU activation on every layer, (2) maxpooling on alternating convolutional layers with a 5x5 filter, and (3) dropouts on the two fully connected layers with a 0.5 keep probability.
Below are the training loop, loss, and regularization functions (link to source code). You can see that I use AdamOptimizer to take advantage of its built-in hyperparameter tuning, which varies the learning rate based on moving averages (momentum) to help the model converge faster, without having to manually tune it myself. You’ll notice that I also use L2 regularization to help prevent overfitting.
Here are the hyperparameters I used (link to source code). My goal was to get the model to converge in less than 50 epochs. Essentially, given time constraints, I didn’t want to spend more than two hours training the model. Everything else is pretty standard. Although, I did decrease my L2 decay rate (i.e. lower penalty on weights) during the tuning process, which yielded a small lift in performance.
Here is the output when I construct the graph (link to source code). I use print statements to verify that the model structure matches my expectations. I find this very useful as it’s easy to get confused when you’re tweaking and testing lots of different models. Especially after 3am. 😵
Final Model Results:
- training accuracy: 100%
- validation accuracy: 99.4%
- test accuracy: 98.2%
Model Iteration & Tuning
Here I’ll try to summarize the approach I took to find a solution that exceeded the benchmark validation set accuracy of 0.93. Although some of the details got lost in the fog of war. I battled with these models for too many days. If you’re curious, you can view a fairly exhaustive list of the models I tested here.
The first steps were to get the most basic version of the LeNet CNN running and begin tuning it. I got 83% validation accuracy without any modifications to the model or preprocessing of the training data. Adding regularization and tuning the hyperparameters made the performance worse. So, I started to explore different types of architectures.
This is where I started making mistakes that cost me a lot of time (although I learned a lot in the process). In hindsight, I should have done two simple things at this point: (1) start applying some basic preprocessing to the data and testing the performance impact, and (2) keep iterating on the LeNet architecture by incrementally adding and deepening the layers.
How hard could it be, right?
DenseNets didn’t seem overly complex at the time, and I probably could have gotten them working if I’d just focused on this. However, in parallel, I tried to get Tensorboard working, which was in beta at the time. Trying to tackle both of these at once was a disaster. In short, creating DenseNets requires a lot of nested functions to create all of the various blocks of convolutional layers. Getting the Tensorboard namespaces to work, getting all of your variables to initialize properly, and getting all of the data to flow in and out of these blocks was a huge challenge. After a bunch of research and trial and error (and coffee), I ultimately abandoned this path. ¯\_(ツ)_/¯
I then tried to implement the (much simpler) inception framework discussed by Vincent during the lectures. After some trial and error, I got an inception network running. But, I couldn’t get it to perform better than 80% validation accuracy, so I abandoned this path as well. I believe this approach could have worked, but by this point, I wanted to get back to the basics. So, I decided to focus on data preprocessing and iterating on the original LeNet architecture (which I should have done from the beginning! Arg.)
After a day of sleep, yoga, and a few dozen ohms to clear my head 🙏…I then got back to work.
I started by applying basic transformations to the data and testing simple adjustments to the LeNet architecture. Model performance started to improve, but I still had a bias problem. In the beginning, my models were consistently overfitting the training data and therefore my training accuracy was high but my validation accuracy was still low.
This is a summary of the tactics I deployed to improve performance.
Here are more details regarding the tactics above (in order of greatest impact on the model):
- Contrast Enhancement. Pound for pound, this tactic had the greatest impact on performance. It was easy to implement and my validation accuracy immediately jumped more than 7%. I only wish I’d done it sooner. As discussed in my initial exploration of the data, I predicted that the low contrast of many of the original images would make it difficult for the model to recognize the distinct characteristics of each sign. This is obvious even to the human eye! But for some reason, I didn’t implement this tactic until halfway through the project. Key lesson: design and test your pipeline around simple observations and intuitions BEFORE you pursue more complicated strategies.
- Augmentation v1 vs v2. The first iteration of my augmentation function boosted performance by 2% (which was great!). However, my range settings for the affine and color transformations were a little too aggressive. This made the training images overly distorted (this was obvious with the naked eye). Because of these distortions, the model kept overfitting (i.e., it achieved high training accuracy but wasn’t able to generalize to the validation set).
- In v2 of my augmentation function, I dialed down the range settings and got a 1% performance boost. Then I added ZCA whitening to improve edge detection and got another 1% lift. In my very last optimization, I then increased the number of images being produced by this function so that there were 6k images per class (instead of 4k). This tactic combined with longer training time yielded the final (and elusive!) 0.4% lift to bring the final validation accuracy to 99.4%. Then I slept.
- More layers and deeper layers. Surprisingly, and after many iterations, I learned that it doesn’t take a high number of layers or incredibly deep layers to achieve human-level performance. That said, some modest increases in the size of the model were critical to breaking the 95% accuracy plateau. You can see from the model diagram that I ended up with seven convolutional layers (five more than LeNet) and that my convolutional and fully connected layers are deeper than LeNet as well. Of course, to mitigate this extra learning power, I had to employ regularization tactics.
- Regularization. Both dropout and L2 regularization proved critical. Initially, I made the mistake of incorporating these too early or setting them too high, which caused the model to underfit. I then removed them altogether until I had a model that was starting to fit (even overfitting) and yielding high training accuracies. At that point, I added regularization back into the model and started to increase it whenever my model was overfitting (i.e., higher dropout and L2 decay values). After a few overcorrections, I ultimately landed on a dropout of 0.5 and decay of 0.0003.
- Bias initialization. Initially, I was initializing my biases at 0.01 (using
tf.constant). Once I started initializing the biases at zero, my accuracy jumped more than 2%. This was a big surprise. Even after doing more research on the issue, I’m still not exactly sure why this small bias initialization negatively affected the model. My best guess is even this small amount of bias was not self-correcting enough during backpropagation, and given that the data was normalized, that extra bias was causing additional overfitting in the model (link to source code).
- Grayscale. Just out of curiosity, I ran a test on a grayscaled version of the augmented image set. The grayscale set still performed well with a validation accuracy of 95.8%. But, this test turned out to be more trouble than it’s worth. The big problem was that there are a bunch of tools out there to help you convert RGB images to grayscale, and none of them (as far as I can tell) provide the correct shape. To feed grayscale images into the network, they need to be rank 4
(batch_size, 32, 32, 1). So, you have to convert each RGB image from
(32, 32, 3)to
(32, 32, 1). Seems simple, right? But all of the scripts I tested strip out the third dimension, yielding an image with shape
(32, 32). And, there wasn’t much help for this issue on StackOverflow, etc. After lots of troubleshooting, I finally discovered the underlying problem and used a simple matrix multiplication to apply the grayscale conversion while maintaining the right shape (link to source code).
Testing the Model with New Images
I happened to be in Prague at the time, so I took advantage by testing my model on images of local traffic signs (The Czech Republic uses the same traffic signs as Germany and most of the EU, I think). For the test set, I gathered 30 new images: 11 of the images were pulled from the internet, and 19 were shot around the streets of Prague. Overall, I made the new image set quite challenging in order to learn about the strengths and weaknesses of the model.
Here is the complete set of new images and their corresponding originals.
Within the new image set, the ones below pose distinct challenges for the model. My hypothesis was that the model would get less than 50% of these correct while scoring above 80% on the other “normal” new images. In particular, some of the signs I found on the streets of Prague seem particularly challenging. How would the model react when it sees two standard signs combined into a single custom sign? Keep reading to find out!
1. Large Vehicles Prohibited. Like many signs that I encountered on the streets of Prague, a single traffic sign includes a combination of two or more signs/symbols.
2. No Trucks or Motorcycles. Again, what are normally two signs are incorporated into one
3. Yield. Yet again, the image includes two signs (this one is from the internet)
4. No Entry. The bracket holding up this sign is damaged, so the sign is heavily tilted.
5. Turn Right. This sign is partially occluded by a very pink van.
6. 50 km/h. The viewing angle makes the image heavily sheared.
7. No Entry. This sign has graffiti on it. Those punks!
8. Ahead Only. This sign is only partially visible.
New Image Test Results
The overall accuracy dropped considerably to 77%, although the model performed pretty well on the new images of “normal” difficulty with 91% accuracy. However, this is still well below the 98.2% accuracy achieved on the original test set. This indicates just how quickly accuracy can drop off when a model encounters new patterns it hasn’t yet seen in the training set.
Top 5 Predictions
Below you can see the top 5 predictions and the corresponding softmax probabilities for a subset of the test images.
Precision & Recall
Next, we’ll measure the model accuracy using precision and recall.
Original Test Images
Listed below are the precision, recall, and F1 scores for the original set of test images.
Here are the worst performing classes among the original test images.
Here are the worst performing classes for the new image set. Not surprisingly, the worst performing class from the original test set (label
27: Pedestrians) is also one of the poorest performers in the new image set.
There are two things I want to call out here:
1. High errors for similar looking signs. If we look at the images from six of the worst performing classes between the two sets, we can see that they all look quite similar. This would help explain the high occurrence of false positives (low precision) and false negatives (low recall). This may also be a case where the transformations done during preprocessing overly distort the images, especially when they’re applied to low-resolution images. The additional loss in resolution can make it hard to distinguish some of the symbols from each other.
Given this, one future improvement to our pipeline would be to review how each transformation affects the various classes, and if needed, create a custom set of transformations to be applied on a class-by-class basis.
2. Too much augmentation is bad. I think the most interesting insight from the precision/recall data is the misclassification of the label
15: No Vehicles. If we look at image samples from this class (below), it is arguably the simplest sign and should be one of the easiest to recognize. But upon further inspection, we can see that the contrast boosting function that boosted performance in other classes actually hurts us in this case. This is because any minor spots or shadows on the central white portion of the sign get exacerbated by the contrast enhancement function. These dark spots can then resemble symbols to the model. This is another example of how class-specific preprocessing tactics could improve the pipeline.