Building a Streaming Dataset in Pytorch

Adam Cohn
Adam Cohn
Nov 24, 2020 · 6 min read

This is the 5th article in our MAFAT Radar competition series, where we take an in-depth look at the different aspects of the challenge and our approach to it. If you want a recap, check out previous posts: the introduction, the dataset, augmentations, and visualizing signal data.

When dealing with a Supervised Machine Learning task one of the most important things you need is data — and lots of it. What does one do when faced with a small amount of data, especially for a task that requires Deep Neural Networks? And how does one create a fast and efficient data pipeline for generating more data that will enable training of Deep Neural Networks without spending hundreds of dollars on expensive cloud GPU units?

These were some of the issues that we faced during the MAFAT Radar Classification competition. My teammate hezi hershkovitz wrote a great post on the augmentations we implemented to generate more training data, as well as our first attempt at a data loader for generating them on the fly. You can read it here.

Issues to Address

  • It didn’t take advantage of the fast vectorized operations available in Python via Numpy and Pandas
  • The information needed for each shift was first written and stored as a dictionary and then accessed in the __getitem__ method using Python for-loops, leading to slow iterations and processing.
  • Generating a “shifted” segment from a track resulted in the same track being reconstructed each time a new segment was retrieved, which also slowed down the pipeline.
  • The pipeline hadn’t generalized the transformation enough to be able to handle either 2D or 3D inputs, which was necessary since we were experimenting with both scalograms and spectrograms, as you can read about here.
  • If we did all the shifts and flips simply as we call the batch, the batch would be saturated with examples which are too similar to the other examples, not allowing the models to generalize well.

The core reason for these inefficiencies was that the pipeline was operating with segments as the basic unit, rather than doing operations on tracks.

Data Format Recap

The image above comes from hezi hershkovitz aforementioned article and shows one complete track from the training dataset when combining all the segments. The red rectangles are the separate segments that are included in this track. The white dots are the ‘doppler burst’ which represents the center of mass of the tracked object.

With the help of the ‘doppler burst’ white dots, we can quite easily see that the track is composed by adjoined segments, i.e. segment id 1942 is followed by 1943, then 1944 and so on.

The fact that the segments are next to each other allows us to use shifts in order to create “new” samples.

However, since each track consists of a different number of segments there would be a varying number of augmentations generated from any given track, which prevented us from using the regular Dataset Pytorch class. Instead, we relied on the IterableDataset Pytorch class to generate a stream of data from each track.

Pipeline Design

To enable that, we created:

  1. A Config class that would hold all of the necessary hyperparameters and environment variables for a particular experiment - this is really just a simple dictionary with predefined keys.
  2. A DataDict class that handles the loading of the original segments, validating each track, creating sub-tracks to prevent data leakage, and transforming the data into the right format i.e. 2D or 3D and prepare it for the augmentations
  3. A StreamingDataset class, a subclass of the Pytorch IterableDataset which handled the augmentations and streaming segments to the models.
config = Config(file_path=PATH_DATA, 
num_tracks=3,
valratio=6,
get_shifts=True,
output_data_type='spectrogram',
get_horizontal_flip=True,
get_vertical_flip=True,
mother_wavelet='cgau1',
wavelet_scale=3,
batch_size=50,
tracks_in_memory=25,
include_doppler=True,
shift_segment=2)
dataset = DataDict(config=config)train_dataset =
StreamingDataset(dataset.train_data, config, shuffle=True)
train_loader = DataLoader(train_dataset,batch_size=config['batch_size'])

DataDict Implementation

We managed to do this using a bunch of tricks and neat features in Numpy and Pandas, leaning heavily on smart usages of boolean matrices for the validation and applying the scalogram/spectrogram transformations to the concatenated segments in a track. The code is too long to share here, but for a look you can go to the repo and check out the DataDict create_track_objects method.

Generating a Stream of Segments

Pytorch Iterable Dataset

Generating streaming datasets is exactly what the IterableDataset class is for. The difference between it and the classic (Map)Dataset class in Pytorch is that instead of implementing a __getitem__ method that receives an index which is mapped to some item in your dataset, for the IterableDataset the DataLoader is calling next(iterable_dataset) until it has built a full batch.

For a great tutorial on working with the IterableDataset class check out this article:

Creating Good Batches

Parallelization

Conclusion

Learning to work with streaming data in Pytorch was a great learning experience and a nice programming challenge. Sometimes changing the conceptual level of organization is necessary to unlock a more efficient way of working with your data.

True to the adage, 80% of our time was indeed spent on data cleaning and pipelines. However, instead of viewing the pipeline as a neccesary evil that had to be dealt with in order to dive deep into the “fun” 20% of modeling, we should view pipelines and processing as an equally stimulating and crucial project, and one that’s necessary since the faster the pipeline, the more experiments you can run.

Gradient Ascent

Learning and sharing on the path to Machine Learning mastery

Adam Cohn

Written by

Adam Cohn

Love working at the intersection of Data, Business & Code. Fascinated by AI, Philosophy, Strategy & History. Fear is the mind-killer

Gradient Ascent

We’re a bunch of people who like doing Data Science projects and write about them. It’s partially for self-promotion but mostly because we’re pretty stoked about what we did and want to share it with y’all

Adam Cohn

Written by

Adam Cohn

Love working at the intersection of Data, Business & Code. Fascinated by AI, Philosophy, Strategy & History. Fear is the mind-killer

Gradient Ascent

We’re a bunch of people who like doing Data Science projects and write about them. It’s partially for self-promotion but mostly because we’re pretty stoked about what we did and want to share it with y’all

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store