Radar Data — Architectures and Ensembling

Shaul Solomon
Nov 25, 2020 · 14 min read

This is the 6th article in our MAFAT competition series, where we give an in-depth look at the different aspects of the challenge and our approach to it. Take a look at the posts covering the the introduction, the dataset, augmentations, and visualizing signal data, and streaming pipelines.

You’ve made it this far — Congratulations!

With the data preprocessed, augmented, and pipelined, we are ready to feed it into our Neural Network. But which architecture should we use?

Having applied the Fourier Transformation to the IQ matrix and inserting the doppler burst matrix gave us a very useful spectogram (the logic is identical for the scalograms). If we take the spectogram/scalogram at face value as an image that reflects the movement of the unknown object, applying a CNN model would be the first reasonable architecture “family” to try. (The go-to stratergy in Data Science projects is to start with the simpler models, the “low-hanging fruit”, and work your way up from there.)

For all of the subsequent models we use:
Loss : Binary Cross-Entropy
Optimizer : Adam
Metric: AUC

The Base Model

The first basic model was actually provided for us from the creators of the competition themselves, and it is a relatively simple CNN model with two Convolutional layers and then three Fully-Connected layers. (Written below in Keras/TF)

# Taken from the code given to us
def create_model(input_shape, init):
CNN model.

While the model preformed extremely well (.94 AUC) on the validation set, it didn’t do very well on the test data (~.74 AUC), which seemed to imply that it wasn’t learning the more generalized features. In such a case the two main approaches are to either increase the data size (we already exhausted the straight-forward solution to that issue) or to make a more complex model.

Small Alex-Net

While there are many pretrained models for image classification, we didn’t want to take a pretrained model (because our style of image is different that the classic CIFAR-10 dataset) and we wanted to not jump to the largest models (Resnet/SENet) due to the principal mentioned above about slowly building up complexity and because due to the limited dataset we could not train on too large an architecture.

Our first choice would be the smallest established image-classication model, AlexNet, but even then wanted to take a smaller version. And so we took the same exact architecture structure but just halved the size of the neurons in each layer. (Written in Pytorch)

The only real change that needed to be made was to change the final layer from classifying 1000 classes, to just two. As we are also using BCE, the final activation is the sigmoid function, to get the final score between [0,1].

Layer (type) Output Shape Param #
Conv2d-1 [-1, 32, 62, 15] 1,600
ReLU-2 [-1, 32, 62, 15] 0
MaxPool2d-3 [-1, 32, 31, 8] 0
Conv2d-4 [-1, 128, 31, 8] 102,528
ReLU-5 [-1, 128, 31, 8] 0
MaxPool2d-6 [-1, 128, 16, 4] 0
Conv2d-7 [-1, 256, 9, 3] 295,168
ReLU-8 [-1, 256, 9, 3] 0
MaxPool2d-9 [-1, 256, 5, 2] 0
Conv2d-10 [-1, 128, 4, 2] 295,040
ReLU-11 [-1, 128, 4, 2] 0
MaxPool2d-12 [-1, 128, 2, 1] 0
Conv2d-13 [-1, 128, 2, 2] 147,584
ReLU-14 [-1, 128, 2, 2] 0
MaxPool2d-15 [-1, 128, 1, 1] 0
AdaptiveAvgPool2d-16 [-1, 128, 6, 6] 0
Dropout-17 [-1, 4608] 0
Linear-18 [-1, 4096] 18,878,464
ReLU-19 [-1, 4096] 0
Dropout-20 [-1, 4096] 0
Linear-21 [-1, 4096] 16,781,312
ReLU-22 [-1, 4096] 0
Linear-23 [-1, 1] 4,097
Total params: 36,505,793
Trainable params: 36,505,793
Non-trainable params: 0
Input size (MB): 0.02
Forward/backward pass size (MB): 1.44
Params size (MB): 139.26
Estimated Total Size (MB): 140.71

While the val score was lower, the final test AUC increased to ~ 0.769 — SUCCESS.

Alex Net

Seeing the increasement we decided to try and test the data on the regular Alex Net:

class alex_mdf_model(nn.Module):

Running it gave us 0.799 AUC.

While we were making progress with the CNN’s they had an inherent issue that we hoped a TCN model would help.

Temporal Convolutional Networks

Before we get into what Temporal Convolutional Networks are, it seems important to stress why there was a need to evolve between the classic CNN architecture.

Very simply put, while addressing our spectogram as an image was a good approximator to extract crucial information, our spectogram was also a reflection of time-series data, which CNN’s are not able to capture.

Classically, for time-series data we would use an RNN but do to this being both a form of an image problem (information is spatialy temporal) as well as being a time series, we would need to do a Seq-to-Seq of a CNN model and then an RNN.

We wanted to avoid heading there (at least initially) because RNN’s alone are much harder to train than CNN’s, and require a lot more data — and in our case we would need to create a seq-to-seq model, combining the output of a CNN into a RNN.

So instead, we wanted to mimic the similar form of results that we would get from the CNN + RNN model with a TCN.

TCN Architecture

Lea et al. 2016

From the highest level view of a TCN — it is a CNN model that has a casual convolution layer (A 1D Conv. Layer) appended to the end that is intended to mimic an RNN, by computing the value of a neuron based off the previous neurons values.

In order to better capture the time-sensitive information, the TCN incorporates two techniques:

  1. The Convolutions are causal, as in they only able to look back in time (no “leakage”). As can be seen in the description of the image, convolutions can only take in information from “older” previous states.
  2. They use Diluted Convolutional Layers. The dilution heuristic is a way to capture the neccesary information with less parameters. In the TCN, we create a series of Blocks, each with an increasing number of dilation. Similar to the base idea of applying a convolution to a convolution, thereby increasing the scope of temporal relevance with much less parameters ( two 3x3 conv filters = 18 parameters vs. 1 9x9 conv filter = 81 parameters), by increasing the dilation, we can cover a much wider temporal field.
Multi-Scale Context Aggregation by Dilated Convolutions 2016

However, we could not just take the out-of-the-box TCN model for a few main reasons:

  1. The original model was created for video segmentation, which means that it outputed an equal number of neurons as was given as an input.

To resolve this we applied a final 1-D conv layer on the final output to give us a single output neuron.

  1. While dilations are good heuristics in general, in our specific case we wanted to increase dilation only on the time-axis. This would allow us to both be able to explore larger time-scales using less parameters while ensuring that at each time-step the model had access to all of the data to make sure that it would pick up the possible important information.

Our code was rewritten based off the code from the locuslab github repo (found here):

# Because we want the model to reflect time-series data we want to mask 
# any information past the current time-stamp.

While in all Neural Network models that are hyper-parameters that need to be tuned — crucial to the TCN one must decide the kernel size, and the number of layers.

For a TCN with n residual blocks we will have a receptive of field of

So for us it would be five layers with kernel_size 3, or four layers at kernel_size 5.

We chose four layers [16,32,32,64] with kernel_size 5.

While it took much longer to train than the CNN models (it is much deeper) it did exceedingly well on the Train/Val data — with .9912 and .9566 AUC scores respectively.

However, on the Test data it got a score of .7874 — similar to the CNN model.

It seemed that for the time being, the model was complex enough and that further improvements to the score would have to come from other directions.

However as discussed in the Ensemble section — we ultimately used both the CNN and TCN together to improve our final score.

To read more about Temporal Convolutional Networks: https://medium.com/@raushan2807/temporal-convolutional-networks-bfea16e6d7d2 https://medium.com/the-artificial-impostor/notes-understanding-tensorflow-part-3-7f6633fcc7c7

Ensembling methods

Close your eyes for a moment and imagine the scenario:
The competition deadline is only one day away, and we have four somewhat equally succesful models to choose from — how do we pick which one to use?

Well we don’t — we want to take all of them.

One of the more classic techniques that all winners of kaggle / data science competitons use is an ensembling method to combine the predictive powers of the various models.

There are actually many different way to combine models, divided into two main categories:

Taken from https://howtolearnmachinelearning.com/articles/boosting-in-machine-learning/


The basic intuition is that certain models will be accurate at different areas of the data, so getting a collective of opinions and then weighing them will help you get the best of all the options.

However, unlike a democracy, we don’t want all predictions to have same significance (weight). We would like to give more weight to the more accurate models. While the methodolgy to choose how much weight to assign can vary, they are all doing the same basic task — combining the outputs of several models together.


Boosting takes the basic intuition and raises it one level. Instead of training each model in parallel, what if we can train Model B to be sensitive to correctly labeling the examples where Model A was unsuccessful. As such, each new model is sequentially trained not just off the data but off the needs of the previous models.

For a more lengthy explanation on the various ensemble methods, check out this great article: https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

The Final Stretch

While boosting had been used in combination with Neural Networks (Schwenk, Bengio), we already had the trained models and the clock was ticking.

We decided to try three types of bagging methods.

We first took the public test data as our validation, and wanted to see how the various methods would perform.

We decided to try three types of bagging methods.

We first took the public test data as our validation, and wanted to see how the various methods would perform.

  1. Individual Metrics

We wanted to run each of the models independently to see what score they would get on the validation dataset as a baseline.
The models all hovered around .77 AUC Score on the public test set.

2. Arithmetic Mean

We took the arithmetic mean of each of the y_predictions. As each of the three models predicted between [0,1] our arithmetic mean was also between [0,1].

The Arithmetic Mean brought out val score to .80 AUC Score — a great improvement with such little work.

3. Weighted Mean

Taking the weighted mean based off their inidividual scores would give more weight to better performing models. In our case, each of the models scored similarily, so the weighted mean accuracy was almost identical to the arithmetic mean accuracy.

4. Logistic Regression

We wanted to see if there was a better linear relationship between the model scores, not just based off of their individual scores, so we trained a Logistic Regression model (we took our val data and split it into two: val_train, and val_test). As Logistic Regression is already bounded between [0,1].

Using Logistic Regression we raised our score to .83 AUC Score on the public test set!

(Usually when trying to compare methods against each other, you need to run them on the exact same dataset, but for the Logistic Regression model, we had to split the data into it’s own train/val, thereby not being perfectly consistent with the other models.

However, because we were able to submit two final predictions, we took the output of the Logistic Regression and the weighted mean independently as our two submissions.)


  1. Always start with a simple model and slowly increase complexity.
  2. General heuristics are a good place to start (looking at the spectogram/scalogram as an image).
  3. Follow the wisdom of the crowds! Even taking simpler models and combining them with more complex models will very likely produce much better result that any model alone.
  4. Often enough, model complexity isn’t enough. Due to the time constraints we weren’t able to spend more time on dealing with label imbalance or the lowSNR, but if had more time, that is where we would invest our effort.

A hearty Mazel Tov for getting through all six articles!
You are ready to get out there and experiment with your own Radar data.

We hope you found these articles helpful and if in your own exploration you found something nifty, please share.

Gradient Ascent

Learning and sharing on the path to Machine Learning mastery

Gradient Ascent

We’re a bunch of people who like doing Data Science projects and write about them. It’s partially for self-promotion but mostly because we’re pretty stoked about what we did and want to share it with y’all

Shaul Solomon

Written by

01101000 01110101 01101101 01100001 01101110 — Aspiring Data Science interested in all matters of expression, self and synthetic.

Gradient Ascent

We’re a bunch of people who like doing Data Science projects and write about them. It’s partially for self-promotion but mostly because we’re pretty stoked about what we did and want to share it with y’all

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store