Debug The Humans: Querying your Training Datasets With Diffgram

Pablo Estrada
Diffgram
Published in
3 min readJul 12, 2021

Early Preview of New Diffgram Query Engine and Data Visualization

Example of “Show me all images with at least 1 kite and 1 stop sign”.

When building ML training pipelines you can easily start getting thousands of images with hundreds of thousands or even millions of different labels, and annotations.

It’s really hard to pinpoint and see where your data is having problems when you get to those huge amounts of data.

You might start seeing lower performance in your model out of the blue and questions like the following might come up:

  • What if you missed some objects that you needed to annotate?
  • What if you want to analyze just the images from sensor X or Y?
  • What if you want to see a very special case where you have 4 cars and 5 pedestrians where your model is having degraded performance?

Today I want to show you a small preview of the new Diffgram Query Engine.

There’s a video walkthrough here, and you can try it yourself too.

This is the first version of a really powerful feature that will enable you to “Debug the humans”. With the new query engine you are able to easily query your entire dataset and view examples of files that match specific criteria's inside your dataset.

“The Query Engine is our first step towards debugging and analyzing human training data. Debug both the humans and the machines”

For example, let’s use the public COCO dataset on diffgram and run some queries to see that we get.

TRY IT HERE (click Dataset Explorer)

Let’s say we want to see all the images that have at least one human.

we would write something like:

labels.humans >= 1

Diffgram will go ahead and display to you all the files that match this criteria. You can make even more complex queries by using and/or statements.

The query engine also support querying file metadata that you attach when uploading the files.

You can read more about uploading files with the Upload Wizard here.

For example, you can get all the images that have a specific sesnsor_id by doing:

files.sensor_id = ax-845

The syntax is simple and to the point.

And if you don’t want to write the queries, we’ve built a query builder tool that can help you create your own queries and explore your dataset with ease.

That’s right, this is super accessible to people of all skill sets. I can simply follow the prompts of the query builder:

Just click any of the available options and Diffgram will start offering you options to keep expanding or narrowing down your data selection.

The Query Builder Makes it super easy to build queries around your traning data

Analyze and Annotate on The Same Platform

Diffgram not only allows you to analyze and query huge datasets to get insights from your training data. It also provide a full batteries annotation platform to add or remove the annotations you need without any data exports or additional tools.

Just click the image you’re interested in and you will be taken to the annotation studio were you can edit, create and delete annotations.

Further, you can easily create Labeling Campaigns for bulk work based on the criteria.

We will soon add all this capabilities to our SDK and create easy ways to export you Diffgram datasets into Pytorch to have your data ready to train in no time.

Let us know if the comments about any new features and ideas you would like for the platform.

For more info visit our Github Repo or diffgram.com

Thanks!

--

--