From semantic segmentation to semantic bird’s-eye view in the CARLA simulator

Maciek Dziubiński
Acta Schola Automata Polonica
11 min readMay 13, 2019


Building a neural network for reconstructing the top-down view from four side cameras.

Modo de volar (A way of flying) by Francisco Goya (source:

In this blog post I’m presenting a neural network architecture for transforming this type of input:

Four side cameras: FrontSS, LeftSS, RightSS, and RearSS (SS stands for Semantic Segmentation).

into this type of output:

Semantic bird’s-eye view.

A quick note here: the side cameras don’t really cover the whole 360° angle around the car, therefore the model needs to infer the missing regions from those actually provided at input. Also: I got the semantic segmentation for free from the CARLA simulator, this isn’t an output from any model, it’s the ground truth.


To reiterate: given four side cameras with built-in semantic segmentation (we’ll discuss that in a moment) we’ll assemble the top-down view from a camera above the car. From this perspective we can immediately see the car’s surroundings, which way the road goes, and the positions of other vehicles on the road. The inspiration for this project was the CARLA Autonomous Driving (AD) Challenge, in which one of the categories is based on input from four side cameras.

If you haven’t heard about the CARLA Autonomous Driving (AD) Challenge, that’s OK, it’s not a prerequisite for this blog post, I’m going to introduce the important parts here anyway.

The objective of the CARLA AD Challenge is to complete a given route (above), provided different types of input data that can be used by the algorithm driving the car. Below are the four categories, or: Tracks, that specify the available input:

Let’s look at Track 2’s description:

Track 2 — RGB cameras: for brave ML teams that want to explore driving with cameras only.

Brave? ML? Yup, that’s us ;)

Now, since this is a simulator, I can gather virtually any type of additional data I can imagine, train a neural network capable of modelling that data, and then use the model’s predictions for controlling the car, for SLAM (Simultaneous Localization And Mapping), or for path planning. In my case, the additional data was the top-down view from a camera mounted 100m vertically above the car, capable of providing ground truth semantic segmentation of the scene — I call it the semantic bird’s-eye view.


OK, so if in the CARLA Challenge the available cameras are RGB, without semantic segmentation, why did I utilize it in my model? There are three reasons:

  1. It was easier to get a straight answer to the fundamental question: “can it be done?”
  2. I figured that once I’ll know the answer to the above question, I’ll be able to come up with a decent semantic segmentation model, especially for the type of synthetic images acquired from a simulated environment.
  3. Also, I figured, once I’ll have a neural network capable of mapping four semantic segmentation cameras into a single top-down view in a simulator, and that if I’d have similar side cameras mounted on a real car, and assuming that I’d have a decent model for semantic segmentation of real-world images, I would be able to re-create this semantic bird’s-eye view in the real world. Easy peasy, right? ;)

Although the last reason is admittedly a bit far-fetched, it’s actually the one I personally find the most compelling.

The architecture

My initial thought was to use a fully convolutional autoencoder for the four input images, combine their bottlenecks in some clever manner, and to build a fully convolutional network for assembling the top-down view. But, if you think about it, fully convolutional models map a given part of the input onto a particular part of the output. Sort of like a “1-to-1” mapping. And for our purposes the architecture should allow for any part of the input to affect any part of the output. Let’s call that an “any-to-any” mapping. (See the EDIT below for a retrospective comment.)

To model this “any-to-any” mapping we need somewhere in our architecture a layer that has this magical property of carrying information from any of its input neurons to any of its output neurons. If only such layer existed…

Obviously, the dense layer works fine, and that’s what I used in the end. But before disclosing the architecture, I want to ask you this, dear Reader: have you come across an architecture that generates an image, that would involve convolutional layers, but no dense layers, in which any part of the input has the option of impacting any part of the output? I was thinking about an attention mechanism combined with a convolutional layer that would have this property but this solution would suffer from the same drawbacks the dense layer does.

I should probably explain my seeming animosity towards the dense layer: the number of parameters in my final model is 8,234,179, of which 7,150,080 come from the dense layers! Thus, I have a hunch there is a more natural approach. And it might utilize dense layers, by all means, but in a more clever way than I did. Maybe a factorization module is the way to go? Maybe there’s a recurrent approach? I’m open to your thoughts and suggestions in the comments section.

The model I used for inference has the following graphical representation:

The relevant parts are:

  • gray: the four inputs from: FrontSS, LeftSS, RightSS, and RearSS cameras (“SS” stands for Semantic Segmentation, of course);
  • orange: the encoder_submodel which is a fully convolutional encoder part of an autoencoder (all inputs share the same encoder model), more about the autoencoder below;
  • yellow: a stack of dense layers followed by a Conv2D, then Concatenate, and then Conv2D—one stack per one input type;
  • orange: the decoder_submodel which is a fully convolutional decoder part of the same autoencoder from which we extracted the encoder_submodel (again, there is one decoder for all types of input);
  • cornflower blue:reconstruction being the final semantic bird’s-eye view.

The inference model was built from building blocks (in particular, the encoder/decoder parts of an autoencoder) of a more complicated model trained using several auxiliary losses. Without going into details: on the one hand, the model learned how to efficiently encode and decode images from the side cameras and the top-down camera using a common autoencoder, and on the other hand, how to reassemble the top-down view from the encoded side cameras’ images. For an implementation check out this notebook.

EDIT: as Bartosz Topolski pointed out, it’s possible to use a fully convolutional architecture that has this “any-to-any” property. You just need to make sure that the filter size of the convolutional layer after the autoencoder’s bottleneck is large enough (with respect to the bottleneck’s size). I’ve checked and he’s right, this is a notebook with a model that doesn’t use dense layers. This solution doesn’t have some of the properties that a “standard” fully convolutional model has (you can’t expect an input of a different shape to be processed correctly), but it stands to prove that the current architecture can be improved. For a 3-class variant (details below) my model had a loss of ~0.062, and Bartek’s had ~0.042 on the test set! And on the 7-class variant mine had ~0.40 and Bartek’s had ~0.36.


A top-down view has a number of advantages:

  • it condenses relevant information from four images into one;
  • it can serve as a building block for localization/mapping algorithms;
  • it can be used for path planning (example of which I’m presenting below);
  • it allows for training a higher-level reinforcement learning agent that chooses an optimal path rather than atomic actions (a bit longer comment in a moment).

I’m going to show results of a crude path planning procedure at the end of this post. I’ll probably devote a future blog post to the topic of possible applications of the method presented here.

Training and Results


I’ve trained the model on data gathered in the two default towns available in CARLA (Town01 and Town02), 40 episodes per town, each episode 1000 frames long. For validation and testing I used separate episodes but from the same towns. All results shown here were produced on the test episodes.

I’ve gathered images with resolution 300 × 200 (same as in the CARLA AD Challenge) but because the training took a lot of time, I’ve decimated them by selecting every other row/column, and then cropped them so that the shape would be divisible by 8 (requirement of the fully convolutional autoencoder), so the final shape was 144 × 96. In this set-up the training took ~16h on my GTX 1080 Ti. Inference on the same machine takes 4.5ms, however note that in a full setup (from RGB to semantic bird’s-eye view) we would also need some time for producing semantic segmentation for the input images.

Semantic bird’s-eye view

Here’s a compilation of the input side cameras, ground truth, and the predicted top-down view (entire results can be found here):

Semantic bird’s-eye view for 7 classes: (roads, road lines), (sidewalks), (buildings), (fences, pedestrians, poles, walls, other), (vehicles), (vegetation), (none).

The rightmost image was constructed by taking an argmax of the predicted class probabilities. This approach has an annoying drawback: the vehicles seem to be reappearing out of nowhere, even though actually the predicted probability for their class is not null. The model was slightly imbalanced and I could have tried calibrating it, and/or instead of using an argmax try to define a better decision making procedure (in terms of a hand-crafted loss matrix like the one you build in decision theory). Also, in the following subsection I’m using a better technique of visualizing the predicted class probabilities. Long story short, the visuals might have been tweaked, but I didn’t want to spend too much time on it.

Instead, I’ve decided to spend some time on utilizing this predicted top-down view for path planning.

Crude path planning

Now, the truth is I only need three classes to be able to drive around and not collide with others: road, vehicles, everything else. Here’s how these results look like (the full video can be found here):

Semantic bird’s-eye view for 3 classes: (roads, road lines), (vehicles), (everything else).

Again, the predicted vehicles materialize very close to our car, but it so happens that because now there are three classes, instead of taking an argmax, I can treat these images as RGB so that we can better see that the model indeed predicts other vehicles.

It’s best to see what I mean:

The white dots denote the waypoints of the path planning procedure. The path is found using a greedy algorithm that scans a range of possible waypoints, takes a sphere centered on a given waypoint, and calculates the “average road” for that sphere. Then, the waypoint with the highest “average road” is chosen, and the procedure is repeated. For implementation see this script, the function is called find_waypoints.

I should emphasize: these dots were added after the data collection process, which is the reason why sometimes the waypoints “can’t make up their mind” and jump from one option to a completely different one. But if the car actually followed the path, such cases would not occur.

The path planning procedure generates waypoints that can then be followed using some standard control procedure, MPC for example (check out my previous blog post to learn more about the MPC).


The results for the 7-class variant look a bit trippy, I must say. The primary reason is that the input doesn’t contain the whole information required for reconstructing the top-down view. Heck, the side cameras don’t provide a full 360° sweep around the car. But also: certain classes are rare and/or form shapes that are thin and I’ve noticed that these types of shapes get lost in the encoding process. U-Net, for example, handles such detailed shapes better because it uses skip connections. I don’t quite know how these skip connections would fit into my current model, but it might be worth looking into.

The 3-class variant is less wobbly in part because the location of objects is more “inferable” from the input, and in part because these three classes are admittedly easier to predict. By “inferable” I mean: besides vehicles, there’s very little that can happen in the narrow regions not covered by any of the side cameras, the model can easily guess that.

The data set is available here and the code here, if you’d like to experiment with the architecture and, ultimately, to come up with a better architecture and results. If you encounter problems with the code, please create an Issue in the repository.

Future work

I suspect that the 7-class-model might yield better results if there was additional information about how tall the objects are. So I think it makes sense to utilize the depth map in the top-down camera as part of the output predicted by the model.

Instead of using a single top-down camera, I’m thinking of using several, suspended at different heights. This would result in a Matryoshka-like set of outputs and should allow for predicting the road and other vehicles at greater distances than currently.

I have a hunch that a generative model might do a better job of “guessing” the parts of the output that don’t have their corresponding parts in the input. This applies to the regions not covered by the side cameras as well as those situations in which a nearby vehicle blocks the view.

Because the crude path planning procedure looks promising, I feel tempted to try out a reinforcement learning agent that would choose a path, perhaps the optimal speed, and (potentially) coefficients of a controller. Then, the path would be followed using the controller, as it’s typically done, but the coefficients of the controller would be dynamically changed by the agent. I’m hoping here that when operating on a higher level, the agent would have a lower chance of overfitting to atomic decisions, and instead learn more general laws governing the environment. Sorry if this paragraph is less clear, I’ll probably devote a future blog post where I’ll explain myself a little better.

Thus far I’ve ignored pedestrians (I’ve excluded them from the simulation) and I must either find a better way of modelling rare classes (e.g. weigh their contributions to the loss function, come up with a better architecture) or emphasize their importance by synthetically enlarging their presence in the top-down view. Actually, come to think of it, making every pedestrian morbidly obese seems to be the right way to go.


In this work the authors solve a similar problem but using homography:

[1] Deng, L., Yang, M., Li, H., Li, T., Hu, B., & Wang, C. (2018). Restricted deformable convolution based road scene semantic segmentation using surround view cameras. arXiv preprint arXiv:1801.00708

If you enjoyed this post, please hit the clap button below and follow our publication for more interesting articles about ML & AI.