Introduction to the CARLA simulator: training a neural network to control a car (Part 2)

Published in

Acta Schola Automata Polonica

16 min readMar 15, 2019

Training neural network models on data gathered with two deterministic controllers and my non-deterministic self.

Before we start, the source code for this whole project is available here. If you have questions regarding the code, please create an Issue. And if you’d like to gain access to the data to re-produce the results, leave a comment, and I’ll try my best to put it out somewhere.

EDIT: in a more recent blog post (here) I’ve decided to use a much more lightweight environment than CARLA to train a model that I used to drive *faster* than the algorithm used for driving the car for data collection.

Introduction

In the previous part of this series, I trained models on depth maps (rather than RGB) collected from the CARLA simulator [1]. (There’s a good reason for this and I’ll discuss it at the end of this blog post.) However, while the essence of Part 1 was: how to create your own race track in CARLA and get a neural network to control a car to go around it, the gist of Part 2 is: how the source of data for training neural network models influence performance on the race track.

This time around I’ve used a different car, one that is faster, more responsive and (subjectively) more fun to control —the Audi TT. The default car (Ford Mustang) has lower acceleration and its tires rarely loose traction. And I wanted the model to have a chance to learn on data in which slippage occurs so that it would, at least in principle, “know how to handle” such situations.

I’ve also added three new race tracks. I’ll go through them in more detail below, but what’s important for now is that I used first two race tracks (creatively named: “01” and “02”) for training, and the other two (“03” and “04” , you guessed it) for testing. The measure of a model’s performance was the distance (expressed in laps around the track) accomplished in 10.000 steps of CARLA’s simulation.

Data

I used three methods of collecting data for training the models:

The Proportion-Derivative (PD) controller introduced in Part 1;
The Model Predictive Control (MPC) which we’ll discuss shortly;
The Game Pad (GP), i.e. I trained the model to mimic my driving behavior.

Here are examples of the three methods of gathering data, as recorded on race track “01”.

Now, what’s crucial is that these simulations were run in synchronous mode (the simulation waits for the model to make a prediction), which gave an unfair advantage to the MPC. I’ll go into more details of what an MPC is but suffice it to say that it’s computationally demanding. Notice that in the movie above the average FPS (Frames Per Second) for MPC is roughly 14. So it takes about 70ms to produce a prediction. That’s partly because I implemented the MPC in Python, and partly because it simply is more complex than, for example, the PD controller. Running MPC (and other models for that matter) in asynchronous mode would require more work and I’ll need to look into that, but for now that wasn’t my focus.

Also, to “augment” the data, I’ve added uniform noise to steering and throttle of all the controllers. The values used for training were the clean, unspoiled values returned by a controller, then noise was added and sent to CARLA.

As a side note, the weather is only an aesthetic effect, it did not influence tire friction, nor the depth data on which the neural network models were trained on. I used weather conditions as an indicator of which model was used to control the car. The convention was: sunny→Proportion-Derivative controller, cloudy→Model Predictive Control, rainy→Game Pad, sunny late afternoon→Neural Network.

Control

In this section, I discuss the four methods of controlling the car used in this experiment:

PD controller

This type of control was already introduced in Part 1 of this series, so I’m not going to spend much time on it here. Shortly speaking, the steering angle at time t is chosen so as to counter the deviation from the pre-defined path — the so-called Cross-Track Error, CTE. The steering angle is composed of two terms, one proportional to CTE, and the other proportional to the time derivative of CTE.

This is the most simplistic version of the PD controller and it served nothing other than providing a baseline. Only the steering was being controlled; the throttle was brutally clipped, keeping the velocity at around some chosen value. Not only that, this type of controller doesn’t make use of the available information provided in the pre-defined path, it only looks at the local state and naively tries to correct it.

MPC

The Model Predictive Control is something different entirely. The idea is to predict how the system (in this case: the car) will behave in the next N steps, assuming a simplistic model of the world (in this case: a kinematic model of motion). The MPC returns actuators (steer and throttle) for the next N steps, but only actuators predicted for the next time step are used for control.

The values of the actuators are determined by minimizing a cost function (I’m only including terms relevant for this discussion, the ellipsis denotes “other terms”):

where δ and a are the steer angle and throttle (actually, acceleration), the {cᵢ} are coefficients chosen by the user, and v₀ is the target speed with which we want the car to drive. The type and number of terms differs between implementations, but what’s important is that:

on the one hand we want to keep CTE small (follow the path), but on the other we want to go with speed v₀;
the cost function is composed of terms spanning N steps into the future.

The MPC minimizes the cost subject to a set of constraints dictated by the kinematic model, which makes this a nonlinear optimization problem. In the accompanying repo I use the SLSQP (Seqential Least SQuares Programming) method available in the scipy.optimize.minimize function.

Once we find an optimal set of actuators, {(δᵢ, aᵢ)}, only the first pair, (δ₁, a₁), is used to control the car. Now, because the cost includes conflicting terms (follow the path, but go as fast as possible), the trajectory generated by MPC is often similar to that of an apt driver. The car approaches some of the corners wide, without much slowing down, some require the car to really slow down, and some (like chicanes, for examples) are elegantly cut.

I recommend this series of videos for a good yet gentle introduction to MPC.

GP

To control the car, I used the Logitech F310 game pad and the pygame module to pass on the signal from the pad to CARLA. I should note that at first, I tried to go as fast as possible — cut corners, drift through some of the toughest turns, and reach 100km/h on the longest straights. The video provided earlier shows exactly that style of driving. But despite my best efforts, I was unable to train a neural network to mimic this type of driving, or even: to train the model so that it would successfully complete at least one lap on any of the race tracks.

So I cheated. I drove the car in such a way that the model had a chance of learning a certain pattern of behavior. This is how it looked like:

The pattern of behavior that I wanted to “teach” the model was to follow the track along the barriers. Still, fitting a neural network to data produced by a human is hard, and we’ll discuss this a bit more in the Discussion.

The neural network model

I used a singular neural network architecture, which is a composition of two sub-models: an embedder, and a multi-task set of output layers. The embedder is a LeNet-like-submodel (similar to the one used in Part 1) used to yield a feature extraction layer or: an embedding layer:

The second part of the model builds on top of the feature extraction layer. First are the steering angles —each has its own hidden layer, followed by a single prediction neuron. Next, the embedding layer is concatenated with the speed (assumed to be provided as input), and the outputs for the steering angles. Here’s how it looks like in code:

Why like this? Why not build prediction for throttle and steering on top of an embedding layer already concatenated with speed?

When I provided speed as input for both steering and throttle, the model tried to cut corners too sharply:

It seems as if the model learned that low speed = sharp turn. When I tried to tweak the architecture (reduce complexity, in particular) I ran into a different problem: the model didn’t turn the car sufficiently strong when the speed was high. (However, in the Discussion section I mention an article that uses an architecture in which speed is an input for steer predictions, and their results look pretty impressive!)

But speed is important for making predictions of the throttle. Thus, the architecture detailed above is my workaround for this problem. Also, I’ve added the steering angle outputs to the embedding layer in hopes of providing extra data for the throttle, but frankly, I don’t think it helped much. Same as with steering, each throttle got its own hidden layer, followed by a prediction neuron.

Admittedly, a dedicated hidden layer for each of the outputs significantly increases complexity of the model and I needed to keep a wary eye on overfitting. A more natural approach would be to use a single, common hidden layer emb for all steering outputs and a
concatenate([emb, inp_speed] + steer_outputs)
layer for all throttle outputs. And such a model performed fine, but the more complex model managed tight corners better at higher speeds.

I’ve tried to generate an informative visualization of the architecture, but this is what I got using keras:

And I’ve really tried to make it more informative by using other tools, but didn’t get much improvement. So, like in Part 1, I claim that when it comes to explaining the model the code does a better job than visualizations.

Race tracks

I’ve generated three more maps the same way as described in Part 1. I first created race track “02” and on it tested a model build on “01”. It failed miserably; this new track was too tangled up. When I trained on “02” and tested on “01” the model also failed. So I decided to train on (“01”, “02”), and created race track “03” for testing, and the model performed OK. Once I finally decided on the architecture, the training process, etc., I added race track “04” to the test set to avoid selection bias, and the final set-up was: train on (“01”, “02”), test on “03” and “04”.

Results

Before jumping into the actual results, two important pieces of information:

1. This is not a thorough study

What I mean by that is that I chose a particular architecture not by systematically evaluating a set of candidates. For example, the multi-task predictions for 10 steps into the future— that’s something I think stabilizes the predictions, makes the ride smoother. But just before publishing this post, I tested how well a model without those predictions performed on race track “04” and it actually did OK. It got worse results, but I can’t say for sure it wouldn’t achieve similar or even better results if I were to devote enough time tweaking it.

2. I had two knobs for controlling the pace of the car

The correlation between predicted vs. true values of throttle for almost all of the models looked like this:

There is some correlation but this is a mediocre performance. Oftentimes, a car wouldn’t even start going because at the initial state the throttle prediction would be too low. To deal with that, I introduced two coefficients, α and β, such that the throttle prediction p was transformed into (α·p + β) before passing it to the CARLA client. This allowed me to squeeze better performances out of all of the models.

But it also introduced human error into the results. A systematic approach would be to run a grid search for a large number of combinations of α and β, and report the best result acquired in this fashion. However, I only had my home PC to evaluate the models, I had to cut corners (excuse the pun).

The following table sums up the performances of the controllers and the neural network models on the four race tracks:

Table 1: Average distances traveled in one episode (10000 steps of simulation) expressed by the number of laps on their respective race track. The data are averaged over 20 episodes, and the averages are supplemented with standard deviations of the results. Note that each race track has a different length so it only makes sense to compare results for each race track (column). The fastest controller and neural network model are shown in bold.

I need to explain the ⟨NN(.)⟩ model. Remember, the neural network was trained to output the next (steer, throttle), and 10 their values into the future. I’ve mentioned that the more complex model performed better, and that I “needed to keep a wary eye on overfitting”. I figured that I can reduce variance by introducing a low-cost ensemble by simply bookkeeping the predictions for time step t made in the previous 10 steps and then averaging them. That is: I kept predictions for time step t made at moments t-10, t-9, t-8, …, t-1, and t, and then averaged them.

In what follows, I’m focusing on those ⟨NN(.)⟩ models, since they seem to perform at least as well their “pure” counterparts.

The best controller is MPC but that’s no surprise. What’s interesting is that the ⟨NN(MPC)⟩ model performs only slightly better than ⟨NN(GP)⟩ even though GP performs much worse than MPC. It’s interesting to see how these two models control the car on race track “04”:

⟨NN(MPC)⟩’s performance on race track “04”

That’s smooooth! I really like the way the car slows down in critical moments, and rapidly accelerates after managing them. And remember, this is this model’s first time on this race track.

Now, the model trained on the game pad data:

⟨NN(GP)⟩’s performance on race track “04”

Notice how wobbly the car behaves, and also: how on occasion the model decides to slow down for no apparent reason. There’s something “wrong” with this model, and I comment on it in the Discussion.

And, for completeness sake, let’s see how the ⟨NN(PD)⟩ model does on the same track:

⟨NN(PD)⟩’s performance on race track “04”

There’s a suitable meme summarizing the performance of the ⟨NN(PD)⟩ model:

Discussion

It seems that the results for ⟨NN(MPC)⟩ are better than for NN(MPC), but in the case of ⟨NN(GP)⟩ vs. NN(GP) the gain is practically nonexistent. I conclude that the reason why ⟨NN(MPC)⟩ performs better than NN(MPC) is because NN(MPC) suffers from overfitting and the ensemble indeed helps reduce the variance of the model. On the other hand, because the ensemble didn’t help the NN(GP) model, it seems that overfitting is not this model’s main problem.

What is, then?

I’ve tried to reduce bias by crafting more and more complex additions to the architecture. There were many of them, too many to list here, but I should note: the reason why the final model is so darn complicated is a corollary of this search for bias reduction.

In the end, I failed to reduce bias, I don’t think that’s why NN(GP) is so wobbly and indecisive.

In his lecture (I highly recommend the whole course!):

Sergey Levine gives two good reasons for why training a model on data collected from a human expert is problematic:

Non-Markovian behavior;
Multimodal behavior.

The question is: why didn’t it prevent me from successfully fitting a model on the game pad data? There are two main reasons:
1. The test race tracks are not that different from the training ones.
2. I claim (I haven’t checked that, I would need to see how well a model trained on RGB performs) that the depth images help reduce the gap between training and test distributions (check out this lecture’s derivation for more details).
Still, I wonder if any of the architectural decisions (predicting future steps, in particular) helped alleviate at least the “multimodal behavior” problem.

Also, from a reinforcement learning perspective, in [2] we read (IL stands for “Imitation Learning”):

Overall, when using human demonstrations, online IL can be as bad as batch IL, simply due to inconsistencies introduced by human nature

Check out that article and also these materials provided by the good people at Georgia Tech — they are truly worth your while! And also, on the topic of this article, this is their architecture:

It looks like providing speed as input for steering predictions didn’t lead to the problems I’ve experienced. I’m guessing that if I spent more time tweaking the architecture, I would eventually get satisfactory results.

Model selection

Like in Part 1, I struggled with model selection. To illustrate my frustration, let me give you a puzzle.

Here are three scatter plots of the predicted vs. true steering angles, your task is to guess which of the models: NN(PD), NN(MPC), NN(GP), they belong to:

OK, I think we can all agree that NN(MPC) was the best model, NN(GP) was second, and NN(PD) was third. Based on that, it’s easy to guess that C is the NN(MPC) model, that’s the scatter plot with the highest correlation on the test set.

I found it a bit confusing that a model with scatter plot A was able to finish even a single lap. Let me explain: it seems that this model correctly predicts the sign of the steering angle (e.g. if the steer angle was less than zero the model generally returned a value less than zero), but the actual magnitude of the prediction looks a bit more like an output from a call to random.uniform(-1, 0) (actually, it’s not, the points are not distributed uniformly). It seems especially worrisome for critical situations, in which the car needs to perform a sharp turn, and from scatter plot A it would seem that in such situations the model can return a value close to zero.

Yet, scatter plot A belongs to the NN(GP) model, and B to the NN(PD) model. Even though instantaneous predictions are flawed, the NN(GP) model is able to beat corner after corner, and finally the whole track. But more then that, the NN(GP) model in reality (and by “in reality” I mean “in simulation”) performs significantly better than a model with higher correlation between true / predicted steer values — the NN(PD) model.

I should note that I selected the best model based on MSE on the test set (by the way, the test set in this context is not race track “03” and “04”, but rather laps around race tracks “01” and “02” not used for training). Instead, perhaps evaluating the models on whole corners (sequences of steering angles) would be a better strategy? I should look into offline policy evaluation strategies, because my current procedure has glaring flaws.

Future work and why use depth images instead of RGB

A high-level list of ideas I’m planning to explore:
1. Get the network to predict the speed, not throttle, and use a controller for throttle to achieve that speed (might not work if changes in speed require the controller to be able to adapt quickly).
2. Other / better controllers—here’s a good source of implementations; also POLO looks like an interesting avenue, but additionally: I have an idea (partly inspired by POLO) of supplementing the cost function in the MPC with “learnable” terms making the simple kinematic model a bit more realistic.
3. Find better strategies of evaluating the models offline.
4. Run CARLA in asynchronous mode.

And why use depth images rather than RGB? This is my miniature car:

I called it Karr (to pay homage to KARR) and it’s loosely based on the F1/10th specification. Those two sensors in the front—that’s ZED and Intel RealSense D435i—two stereo cameras that provide depth images. The processing unit is the Jetson TX1 with its own GPU, capable of fairly fast inference using neural networks.

My plan is to transfer knowledge from the simulation to the real world and to use a model trained in CARLA to drive around a race track with Karr. Other have already tried this using RGB data (see for example [3] and [4]), but I have a sneaking suspicion that depth images may suffer from a narrower “simulation-to-reality gap”.

But also, depth images, if treated like point clouds, allow for a range of informative linear transformations for augmenting data (like translations and rotations), providing a better representation of the data to a neural network (making use of the bird’s eye view but better), and localizing the car on the track.

I don’t know yet what I’m going to write about in Part 3, but there’s plenty to choose from:
1. Karr and domain adaptation: fighting the simulation-to-reality gap.
2. A more detailed look into CARLA and Unreal Engine 4.
3. Intro to robotics and depth cameras.
If you have a pick I would love to hear from you in the comments below!

References

[1] Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: An open urban driving simulator. arXiv preprint arXiv:1711.03938.
[2] Pan, Y., Cheng, C. A., Saigol, K., Lee, K., Yan, X., Theodorou, E., & Boots, B. (2017). Agile off-road autonomous driving using end-to-end deep imitation learning. arXiv preprint arXiv:1709.07174.
[3] Yang, L., Liang, X., Wang, T., & Xing, E. (2018). Real-to-Virtual Domain Unification for End-to-End Autonomous Driving. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 530–545).
[4] Bewley, A., Rigley, J., Liu, Y., Hawke, J., Shen, R., Lam, V. D., & Kendall, A. (2018). Learning to Drive from Simulation without Real World Labels. arXiv preprint arXiv:1812.03823.

Other cool resources

This blog post is definitely worth checking — I can’t wait for the next part from Antonin Raffin, I am a fan of everything he’s put out so far.
This blog post and the accompanying article about depth prediction (aka
“depth sensing”) using monocular videos.
Depth sensing at Berkley.
And finally, definitely check out the CARLA Autonomous Driving Challenge.

If you enjoyed this post, please hit the clap button below and follow our publication for more interesting articles about ML & AI.