This is the 5th article in our MAFAT Radar competition series, where we take an in-depth look at the different aspects of the challenge and our approach to it. If you want a recap, check out previous posts: the introduction, the dataset, augmentations, and visualizing signal data.
When dealing with a Supervised Machine Learning task one of the most important things you need is data — and lots of it. What does one do when faced with a small amount of data, especially for a task that requires Deep Neural Networks? And how does one create a fast and efficient data pipeline for generating more data that will enable training of Deep Neural Networks without spending hundreds of dollars on expensive cloud GPU units?
These were some of the issues that we faced during the MAFAT Radar Classification competition. My teammate hezi hershkovitz wrote a great post on the augmentations we implemented to generate more training data, as well as our first attempt at a data loader for generating them on the fly. You can read it here.
Issues to Address
However, the data pipeline suffered from a few issues, primarily with regard to speed and efficiency:
- It didn’t take advantage of the fast vectorized operations available in Python via Numpy and Pandas
- The information needed for each shift was first written and stored as a dictionary and then accessed in the
__getitem__method using Python for-loops, leading to slow iterations and processing.
- Generating a “shifted” segment from a track resulted in the same track being reconstructed each time a new segment was retrieved, which also slowed down the pipeline.
- The pipeline hadn’t generalized the transformation enough to be able to handle either 2D or 3D inputs, which was necessary since we were experimenting with both scalograms and spectrograms, as you can read about here.
- If we did all the shifts and flips simply as we call the batch, the batch would be saturated with examples which are too similar to the other examples, not allowing the models to generalize well.
The core reason for these inefficiencies was that the pipeline was operating with segments as the basic unit, rather than doing operations on tracks.
Data Format Recap
To recap, the MAFAT data consisted of fixed-length segments of Doppler Radar signals represented as 128x32 I/Q matrices; however, in the dataset there were many segments that were part of the same track, i.e. a longer duration radar signal, with anything from 1 to 43 segments in one track.
The image above comes from hezi hershkovitz aforementioned article and shows one complete track from the training dataset when combining all the segments. The red rectangles are the separate segments that are included in this track. The white dots are the ‘doppler burst’ which represents the center of mass of the tracked object.
With the help of the ‘doppler burst’ white dots, we can quite easily see that the track is composed by adjoined segments, i.e. segment id 1942 is followed by 1943, then 1944 and so on.
The fact that the segments are next to each other allows us to use shifts in order to create “new” samples.
However, since each track consists of a different number of segments there would be a varying number of augmentations generated from any given track, which prevented us from using the regular Dataset Pytorch class. Instead, we relied on the IterableDataset Pytorch class to generate a stream of data from each track.
The high-level goal of these three objects was to create a stream of
_Segment objects, that was flexible enough to handle both tracks and segments but give a consistent semantics across the code:
To enable that, we created:
Configclass that would hold all of the necessary hyperparameters and environment variables for a particular experiment - this is really just a simple dictionary with predefined keys.
DataDictclass that handles the loading of the original segments, validating each track, creating sub-tracks to prevent data leakage, and transforming the data into the right format i.e. 2D or 3D and prepare it for the augmentations
StreamingDatasetclass, a subclass of the Pytorch
IterableDatasetwhich handled the augmentations and streaming segments to the models.
config = Config(file_path=PATH_DATA,
shift_segment=2)dataset = DataDict(config=config)train_dataset =
StreamingDataset(dataset.train_data, config, shuffle=True)train_loader = DataLoader(train_dataset,batch_size=config['batch_size'])
Processing the segments into tracks and then back into segments in the DataDict presented a great opportunity for speeding up the code, especially if the data validation, re-splitting, and track creation could be vectorized.
We managed to do this using a bunch of tricks and neat features in Numpy and Pandas, leaning heavily on smart usages of boolean matrices for the validation and applying the scalogram/spectrogram transformations to the concatenated segments in a track. The code is too long to share here, but for a look you can go to the repo and check out the DataDict
Generating a Stream of Segments
Once the dataset was transformed into tracks, the next challenge was doing the splits and shifts in a faster way. Here Numpy provided all of the tools necessary to do fast, matrix-based operations and quickly generate a new set of segments from a track.
Pytorch Iterable Dataset
Once the track has been split up into segments again, we needed to write a function that would augment one track at a time and send the newly generated segments into a stream from which to generate batches of segments from multiple tracks. The last point was crucial to ensuring that each batch was a reasonable representation of the data distribution.
Generating streaming datasets is exactly what the IterableDataset class is for. The difference between it and the classic (Map)Dataset class in Pytorch is that instead of implementing a
__getitem__ method that receives an index which is mapped to some item in your dataset, for the IterableDataset the DataLoader is calling
next(iterable_dataset) until it has built a full batch.
For a great tutorial on working with the IterableDataset class check out this article:
How to Build a Streaming DataLoader with PyTorch
Learn how the new PyTorch 1.2 dataset class `torch.utils.data.IterableDataset` can be used to implement a parallel…
Creating Good Batches
Building off of that example, we created an implementation with the core
process_tracks_shuffle making sure that each batch served by the DataLoader would contain a good mixture of segments from multiple tracks. We did this by setting the
tracks_in_memory hyperparameter that allowed us to adjust how many tracks would be processed and saved to working memory before a new stream would be generated.
One of the big potentials for speeding up data processing even further that we didn’t utilize would have been to generate multiple streams by parallelizing the processing of tracks across multiple GPU workers. One thing to keep in mind though is that parallelization with the IterableDataset is not as straightforward as with the standard Dataset class, since just adding workers with an IterableDataset leads to each worker getting a full copy of the underlying data.
As our understanding of the problem increased, and our vision of what we want to attempt became clearer, our need for a more dynamic and streamlined pipeline became ever more prominent.
Learning to work with streaming data in Pytorch was a great learning experience and a nice programming challenge. Sometimes changing the conceptual level of organization is necessary to unlock a more efficient way of working with your data.
True to the adage, 80% of our time was indeed spent on data cleaning and pipelines. However, instead of viewing the pipeline as a neccesary evil that had to be dealt with in order to dive deep into the “fun” 20% of modeling, we should view pipelines and processing as an equally stimulating and crucial project, and one that’s necessary since the faster the pipeline, the more experiments you can run.