Building a Streaming Dataset in Pytorch

Published in

Gradient Ascent

6 min readNov 24, 2020

This is the 5th article in our MAFAT Radar competition series, where we take an in-depth look at the different aspects of the challenge and our approach to it. If you want a recap, check out previous posts: the introduction, the dataset, augmentations, and visualizing signal data.

When dealing with a Supervised Machine Learning task one of the most important things you need is data — and lots of it. What does one do when faced with a small amount of data, especially for a task that requires Deep Neural Networks? And how does one create a fast and efficient data pipeline for generating more data that will enable training of Deep Neural Networks without spending hundreds of dollars on expensive cloud GPU units?

These were some of the issues that we faced during the MAFAT Radar Classification competition. My teammate hezi hershkovitz wrote a great post on the augmentations we implemented to generate more training data, as well as our first attempt at a data loader for generating them on the fly. You can read it here.

Issues to Address

However, the data pipeline suffered from a few issues, primarily with regard to speed and efficiency:

It didn’t take advantage of the fast vectorized operations available in Python via Numpy and Pandas

Building a Streaming Dataset in Pytorch

Issues to Address

Written by Adam Cohn