How hard is it to build a self-driving car with a budget of $60 in more or less 150 hours? Well, (spoilers alert) more than we have thought. In this post, we explain how we have assembled and successfully trained a robot car using a deep learning framework.
The main idea was to replicate the paper End to End Learning for Self-Driving Cars in a simplified environment. Using the available hardware at our disposal: a limited Lego NXT robot and a Logitech camera, we bought a Raspberry Pi 3, a 16 gig sd card and a power-bank. We combined all these parts to form a robot car with vision capability. The goal was simple and direct: train the robot to drive through a track made of paper.
Okay, it didn’t look like a new Tesla at all (can you see the fancy rubber bands that we used to make everything stick together?) but it turns out to be a good proof of concept. We are releasing the code on GitHub so that anyone can replicate the results.
“Shakey 2”, as we kindly nicknamed it after Stanford’s robot Shakey, is a mobile robot that uses a differential-drive design. Where each wheel is connected to a motor that can be independently controlled.
For our driving purposes, we synchronized the motors to achieve all our high-level commands — “up”, “right” and “left” (since we used the arrow keys to command the robot, by “up” we mean “forward”).
One major issue when dealing with embedded systems is the limited resources available; tons of great ideas don’t come out of the paper because of memory and processor constraints. Fortunately, at least for driving our robot car, this wasn’t an issue.
The program that drives the NXT car was written in Python using the open source library nxt-python, and ran entirely on Raspberry Pi. This approach relied on the Lego Communications Protocol (LCP), embedded in leJOS and Lego firmware, to send and receive data from actuators and sensors without having any code actually running on the NXT intelligent brick (the “robot’s brain”).
Imitation as supervised learning
How to program a robot to drive a car down a road without crashing? One answer is imitation: train a model to map inputs (images) to actions (car commands) in the same way an expert would do it. We addressed this imitation problem as a simple supervised learning problem (a different point of view worth checking is the one from the field of imitation learning).
From the machine learning perspective this problem is quite straightforward. We can see the driving task both as a classification problem or as a regression problem.
- As a classification problem we have a collection of images and commands; for each image we associate the command as the image’s label. We train a parameterized model to map each image to a distribution over the classes. It’s desirable that this distribution is close to the one that has generated the dataset, hence we train the model using the maximum likelihood estimation method.
- In the case of regression we associate each image to a vector of real values (acceleration, steering wheel angle, etc.). Here our parameterized model tries to generate predictions as close as the real ones, so we train the model by minimizing some error measurement between the model’s prediction and the ground truth (like the mean squared error).
As a first approach, we decided to frame the driving problem as a classification one. We abstract the robot’s control to only three categories: “up”, “left” and “right”. So, we search for a function f mapping 45x80x3 road images to a probability distribution over the actions.
The function f could be from any family of models, in this experiment we restrict ourselves to two families only: Deep Feedforward Networks (DFN) and Convolutional Neural Networks (CNN).
Since the data points are acquired while driving, the assumption of independent and identically distributed data (i.i.d.) is violated. We tried to solve this problem by collecting data in different road conditions (i.e., different lighting and different floor details), keeping the shape of the track the same — an oval track.
The data collection phase was composed of laps along the track. We controlled the car and most of the time we tried to drive the car in the center of the lane. We got almost 4 hours of data (way less if compare to Nvidia’s 72 hours of training data). In this kind of track we saw ourselves going straightforward way more than turning, as a result we acquired a non-balanced dataset.
In order to fix this, we created new data points by flipping images with labels “left” and “right”.
One of the bottlenecks of this project is the “forward pass”, i.e. calculating the probability distribution from an image. All the training was done in a Desktop computer way more powerful than the Raspberry Pi, but to perform the car’s control the forward pass takes place inside the Raspberry Pi. So we decided to experiment using images with one channel only (this reduces the number of features from 10800 to 3600). The transformation that we got the best result was the binarization.
Hence, we ended up with two types of models: the ones trained on the original dataset, and the ones trained on the binarized dataset. For anyone interested in the data, it’s also available on GitHub.
Deeper is better?
We have trained different networks architectures. For the DFN models each architecture is described as a list, for example “[774, 3]” stands for a network composed with one hidden layer of size 774 and one output layer of size 3. The table below shows all the different results:
The simplest model — the softmax classifier,  — already presents some interesting results; we’ve noticed that it was easier to get good results for the categories “left” and “right” (with accuracy of more than 80% for each one) but we have struggled with the accuracy for the label “up”. The car should be able to turn properly, but it should also know how to exploit any straight path (not only the ones in the middle of the lane) in order to move forward. Hence we prefer models with good accuracy on all the different categories (like the one with architecture [1333, 200, 3]).
To describe the different CNN models we used a similar notation: “[(24, 5), 731, 3]” indicates that there is one convolutional layer with 24 filters (with kernel size= 5x5) one hidden layer of size 731 and one output layer of size 3 — we always add a max pooling layer (with kernel size= 2x2) after a convolutional layer.
One might think that a deeper architecture model with high accuracy, like the DFNs with two hidden layers, would be the best choice. That would be reasonable if we had enough computational power and did not require an almost real-time response from our system. So for our on-board processing requirement, shallow networks under 1000 units were a better choice.
Using the accuracy on the test set we ended up selecting 4 DFN and 2 CNN models. The best model was a CNN with architecture [(36, 5), 3] trained on the original dataset. The confusion matrix for this model is shown below:
Well, did it work or not?
Yes, it did!
After selecting the models, we observed how each one of them would behave by using them to print the probability distribution on a new set of images — this was just a very simple form of simulation.
At the end, we choose two models to perform in the real world:
- a DFN with architecture [276, 3] using the original image as input.
- a CNN with architecture [(36, 5), 3] also using the original image as input.
Both models presented a forward pass time of approximately 1.35 seconds. The models trained on the binarized dataset were faster (with forward pass time of approximately 0.6 seconds), but the accuracy of these models were no better than the accuracy of the models mentioned above. At the end, we decide to be slow and safe. And boy, oh boy, the robot car was slow!
Once our choice of design was a differential drive car, we had no trouble with the power applied for each wheel while turning “left” or “right”, since the car can turn on its own axis. However for “up” labels, finding the right amount of power was a major thing for achieving success on this project.
We started with 20% of power for both wheels and at first it seemed as a good choice, but we had some problems with the “up” command. We noticed at this point that the robot predicted the correct turn action but performed it outside of the paper track, due to its velocity.
To cope with the processing time of “take image — predict action — execute action” loop, we choose to slow down to 10% of power in forward movements. Which made it slow as turtle, but it got the job done.
After successfully driving in the training track we assembled a test track with a new shape (new types of curves and different lane sizes) to check if the model could generalize to tracks that it had not observed previously.
And it did!
Build a complete self-driving car is no joke (it takes more than a Raspberry Pi and a Lego robot, that’s for sure). Here we tried to show that the task of lane and road following can be achieved by anyone with some robotics and machine learning skills.
The use of the NXT Lego robot doesn’t make this project a low-cost one (one NXT kit costs $549.99). We chose this robot simply because it was available for us. If you don’t have it, there are a lot of different and cheaper options available (for example, the project DeepPicar uses a New Bright 1:24 scale RC car — it costs only $10). Our code can easily be adapted for any car platform, so we welcome anyone to use it and contribute to it :)
Well, the fun isn’t over. There are also a lot of things to pursue next, here are some examples:
- Sometimes the kind of data that the model sees in training is way different than the data that it sees while driving in the real world: a miss-classification (for example, changing the command “left” for “right”) can put the car in a different position in the track exposing the car to images never encountered before. We think that one way to circumvent this problem is to create a new dataset where we control the car normally but we introduce some random disturbances while driving.
- With better hardware, it is possible to use robust CNN models together with some visualization techniques to have some understanding of the network’s inner workings.
- The Ackerman drive model is the standard in robotics cars community. A natural next step is to combine our framework with this model, allowing the robot car to accelerate and steer at the same time. This would require a new and refined dataset, that associates for each image a label composed by steering angle and motor power, also reviewing all the machine learning process.
The road to autonomous vehicles is not a short one, but we hope to give you the sense that it is way easier than it seems to take the first step.