Photo by Paweł Czerwiński on Unsplash

The MAFAT Dataset — A Closer Look

Adam Cohn
Adam Cohn
Nov 17, 2020 · 5 min read

This is the 2nd article in our MAFAT Radar competition series, where we take an in-depth look at the different aspects of the challenge and our approach to it. If you want a recap, check out this post.

Let’s jump straight in.

The competition organizers give a clear explanation of the data they provide:

The dataset consists of signals recorded by ground doppler-pulse radars. Each radar “stares” at a fixed, wide area of interest. Whenever an animal or a human moves within the radar’s covered area, it is detected and tracked. The dataset contains records of those tracks. The tracks in the dataset are split into 32 time-unit segments. Each record in the dataset represents a single segment. A segment consists of a matrix with I/Q values and metadata. The matrix of each segment has a size of 32x128. The X-axis represents the pulse transmission time, also known as “slow-time”. The Y-axis represents the reception time of signals with respect to pulse transmission time divided into 128 equal sized bins, also known as “fast-time”. The Y-axis is usually referred to as “range” or “velocity”

The following datasets were provided:

  • 5 CSV files (Training set, Public Test set, and Auxiliary set (3 files)) containing the metadata,
  • 5 pickle files (serialized Python object structure format) containing doppler readings that track the object’s center of mass and slow/fast time readings in the form of a standardized I/Q matrix.

The Auxiliary datasets consisted of:

  • An Auxiliary “Experiment” Dataset of human only labeled recordings, but they were recorded in a controlled environment, which doesn’t necessarily reflect a “natural” recording.
  • An Auxiliary “Synthetic” Dataset with low SNR segments that were created by transforming the high SNR signals from the train set.
  • An Auxiliary “Background” Dataset — Segments that were recorded by a sensor in parallel to segments with tracks but at a different range. These segments contain the recorded “noise.” Each segment also contains a field mapping to the original High or Low SNR track id.

Braden Riggs & George Williams from GSI Technology — SPOILER ALERT: they were the winning team — wrote a very thorough post at the start of the competition where they provide a great overview of the dataset and give key insights into the challenges posed by it. We’ll give a summary below, and for those who want to read the whole thing, it’s available here:

Data Description

The radar data had a few different important characteristics worth explaining:

Signal-to-Noise Ratio (SNR)

The SNR refers to the quality of the signal that produced the data i.e. to what degree the signal was generated by the movement of the target as opposed to some other internal or external noise-generating process, for example the weather, or the inherent noise of the machine.

I/Q Matrix

An I/Q Matrix is a N x M matrix with complex values, in our case 32 x 128. The real and imaginary parts result from the amplitude and phase components of the doppler radar reading. In short, even though the radar is picking up a very complicated wave, it can still be described using only the amplitude and phase of two sinusoidal signals in quadrature. For a good explanation, read this more lengthy description. Each row corresponds to a “slow-time” radar pulse while the columns are a point in the “fast-time” reading of the reflected signal, which corresponds to the distance from the origin.

If you want to do a deep dive — MIT has a lecture series just for the courageous few:

Doppler burst

The doppler burst reading is a vector indicating the location of the “center of mass” for each long-time radar burst.

Segments vs Tracks

The data was originally recorded in tracks, which did not have a set time length. However, they were split into 32 second time frame segments, and we needed to predict a classification from just the 32 second segment. While in the train data we were given the track-id for the given segments (and therefore could theoretically restitch it together) in the test data we did not know the track ids and therefore couldn’t rely on a longer timeframe to use for prediction.

Challenges

The small size of the training data

The Training set consisted of only 6656 segments, while the test set had 106 segments. To put that into some perspective, the CIFAR-10 dataset has 60000 images, and the Image Net Dataset has over 14 Million. In short, we’d need to generate a lot more data if we’d want to use any Deep Learning algorithms as a classifier.

Signal to Noise Ratio Imbalance

There was a 1.7:1 ratio of Low SNR to High SNR segments in the train set. Not only were the segments inconsistent with their SNR, but the overwhelming majority (~2/3) of them were extremely noisy.

In the test set, the lowSNR:HighSNR ratio was much more balanced, closer to a 1:1.

Class Imbalance

There was a majority of animal segments/tracks in the training data, which would inevitably create a bias in any model towards predicting animals. Again in the test set, the ratio of labels was more balanced.

The Scoring Metric

To quote the official website:

Submissions are evaluated on the Area Under the Receiver Operating Characteristic Curve (ROC AUC) between the predicted probability and the observed target as calculated by roc_auc_score in scikit-learn (v 0.23.1).

If you’re unfamiliar with ROC AUC — check out this article.

Conclusion

The next article in the series will outline how we dealt with the primary limitation we saw, namely the limited amount of training examples, by going deeper into the data augmentation techniques we utilized for this challenge. Stay tuned!

Gradient Ascent

Learning and sharing on the path to Machine Learning mastery

Adam Cohn

Written by

Adam Cohn

Love working at the intersection of Data, Business & Code. Fascinated by AI, Philosophy, Strategy & History. Fear is the mind-killer

Gradient Ascent

We’re a bunch of people who like doing Data Science projects and write about them. It’s partially for self-promotion but mostly because we’re pretty stoked about what we did and want to share it with y’all

Adam Cohn

Written by

Adam Cohn

Love working at the intersection of Data, Business & Code. Fascinated by AI, Philosophy, Strategy & History. Fear is the mind-killer

Gradient Ascent

We’re a bunch of people who like doing Data Science projects and write about them. It’s partially for self-promotion but mostly because we’re pretty stoked about what we did and want to share it with y’all

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store