Stream Training Data To Your Models With Diffgram

Pablo Estrada
Diffgram
Published in
6 min readSep 7, 2021

The standard was to manually export your data, then write a script to feed the data to your models for training.

Today we are changing that with the all-new:

Diffgram StreamingDirect to Memory for Pytorch and Tensorflow

This is huge! But before we get ahead of ourselves, let’s provide some context.

The current state was only manual exports

While manual exports are still widely use and totally valid in some contexts, it has some problems like:

An example of the usual flow for manual exporting training data.
  1. Decentralization of the information. Once you download the export, it is completely detached from any system, so you can not be able to track future changes to the same dataset.
  2. Problems sharing versions. If John generates an export of Dataset A, and then Paul changes some of the files in the dataset. Paul will have to notify John about the changes, otherwise John’s export file will be outdated.
  3. High resource usage, when having huge amounts of training data, it can be almost impossible to have all the information in memory, so developers have to be aware of memory management or just go and pay for bigger machines.
  4. Need of transformation scripts to input them into your favorite AI framework like Pytorch or Tensorflow.

The list goes on!

Introducing Streaming Training Data — Direct To Memory

Load Training Data on Demand to Pytorch and Tensorflow

Here we show the difference between before, with manual export, and now with Diffgram data streaming.

Comparison of before and after diffgram

Optimize your datasets for ML and say goodbye to boilerplate code. This is the fastest way to get your data for all machine learning tasks including computer vision. A true Data 2.0 format.

With the newest version of the Diffgram SDK, we have updated our Directory object, so that all the files in this dataset can be directly ingested without the need of generating an export file.

How do we do this?

We’ve made our datasets python iterables that stream each item on demand to your local machine.

Also, we’ve implemented some methods that will easily transform the data into pytorch or tensorflow for easier ingestion on models.

And more! Let’s skip to the examples!

Example Code — Stream ML Training Data

Access a dataset, access an element in it, and connect it to pytorch or tensorflow.

dataset = project.directory.get('my dir')# Stream the first element
file1 = dataset[0]
# Loop through all files
for file in dataset:
print(file)
# Display an image
from matplotlib import pyplot as plt
plt.imshow(dataset[0]['image'])
plt.show
# Transform for usage with your favorite framework
pytorch_dataset = dataset.to_pytorch()
#OR
tf_dataset = dataset.to_tensorflow()

This gives you the ability to have your training data ready to be ingested by your tensorflow or pytorch models.

Colab Notebook Example

CoLab Notebook — Full Diffgram Streaming Example

In this notebook you will see a full example on how to stream a big dataset (100k+) images into your training model without having to load the entire data on your local machine. We will use the Pytorch Fast RCNN network.

On Demand Access

Even better, all the data is only transferred to your RAM when needed, totally On Demand.

So if you have a dataset with 100,000 or even 1M images, you won’t need the ram to store 100,000 images + annotations all at once.

Diffgram’s SDK will handle the complexity of feeding the data the model is asking during the training loop. That’s right! Including automatically re-using examples to avoid network calls and more caching goodness.

We have worked on an example that show the basic usage of the new features of the SDK.

Works with Queries

For example, here you can get images with more than 3 cars and at least one pedestrian:

sliced_dataset = dataset.slice(
'labels.cars > 3 and labels.pedestrian >= 1')

Query docs

Works with many Nodes

Training on multiple nodes? Send slices of data to each machine without the need to access it all first. You can load the data from anywhere and do big scale queries.

Automatic Security benefits

Often big teams have different security models with data in different contexts. By using Diffgram, you can set role based access control once, and have that propagate through to training. This means a single privacy and security model, instead of the data team first trying to grab all the data and then have multiple sets and duplicate data.

Advantages of Streaming Approach of Training Data:

  1. Instantly start training without any export. And save time on transformation steps too.
  2. Centralize all the training data in a single place. Now you won’t have 20 different JSON’s of the same dataset. As long as you use the SDK, we’ll guarantee that the latest training data is available there.
  3. Reduce memory usage during training, you will still be able to train huge datasets without worrying about the resources of the machine. Diffgram will keep a cache of all the data that has been fetched, and discard the old data that has already been given to the model during the training loop.
  4. Scale every aspect of your training. Stop using JSON, YAML, XML, or other file types to ingest the data to your model. Keep the training data on your cloud, for everyone in the team to access.

That’s just the tip of the iceberg. There are so many more benefits!

Works with the rest of Diffgram

Diffgram Training Data Platform for Machine Learning

Want to query your data? Annotate it? Bring the full power of open source Diffgram to your team.

Expanding on the earlier example, you can ingest data into Diffgram with the import wizard, explore the data, create customizable automations, and of course our best in class human centered workflows and annotation experiences.

This means you can go from ingest, to annotate, to exploring a slice of your data, to training instantly.

Get it Now

It’s easy to get started with open source Diffgram, install in a few minutes:

  1. Install Open Source Diffgram Now or try it on diffgram.com

2. Install the latest SDK or Read the docs

3. See the CoLab training data streaming example.

If you already have Diffgram installed, see the update guide, and be sure to pip install -upgrade diffgram sdk.

We want your thoughts!

This is a very new approach, so we are really curious about what you think about this. Try it out and let us know!

  • Do you like this idea?
  • Do you want more AI frameworks to be supported?
  • Do you still prefer JSON exports? (We still have them though)

Let us know in the comments below!

Shoot us a message on slack, or create issues on github with the questions you have.

We will keep improving this streaming data approach throughout 2021 and 2022.

Thanks for reading!

Diffgram.com Open Source

Side notes:
Thinking about dataset versioning? No need! Diffgram automatically versions data for you. Compare model results. Compare QA corrections vs original. You name it.

Disclaimers:

Note that when we say “instant” there are still bandwidth and other compute limits. For example depending on your hardware it may take longer to access the entire dataset. In either case — there’s no context switching, so your system can get started instantly.

--

--