Know Your Data: a new tool to explore datasets

People + AI Research @ Google
People + AI Research
2 min readMay 19, 2021
The Know Your Data tool

By Daniel Smilkov, Nikhil Thorat, Marie Pellat and Ludovic Peran, Google PAIR

We are excited to announce the beta release of Know Your Data, a new tool to help researchers and product teams better understand datasets, improve data quality and mitigate bias issues.

We hope this tool can help the ML community explore, discuss and improve datasets and, ultimately, the ML models trained on that data. Low data quality stems from a range of issues, from incorrect labels to imbalance across attributes. It contributes to machine learning bias and fairness issues, and it can lead to cascading failures.

While there are already fairness and interpretability tools available for ML analysis, including the ones in Google’s Responsible AI Toolkit, we found that data exploration remained tedious. Through our research, we discovered that visual discovery, mixed with aggregated statistics, could help ML builders get a deeper understanding of their datasets before training a model, or when debugging it. We created Know Your Data to make data assessment more accessible and efficient.

In this beta release, we’ve visualized over 70 image datasets, to help answer the following questions:

  • Is the data corrupted? (e.g. broken images, garbled text, bad labels, etc)
  • Is the data sensitive? (e.g. are there people, any explicit content, etc)
  • Does the data have gaps? (e.g. lack of daylight photos)
  • Is the dataset balanced across various attributes?
Know Your Data in action

One key feature of Know Your Data is that it allows users to explore the dataset by information that wasn’t originally in the dataset. The tool annotates the existing data with additional information. It does this using machine learning models like Cloud Vision labels, Cloud Vision face detection, and general image quality metrics (e.g. sharpness and brightness).

One of the main goals of Know Your Data is to keep humans in the loop, making it easier for people to assess datasets. Features — like interactive exploration with real-time filtering and statistics and the image browser which shows individual data points — were created with this in mind.

Know Your Data continues to be under active development. Please join our mailing list to stay updated. You can find more information about how to use the tool on the website. We also welcome your feedback and feature requests via GitHub.

And lastly, we extend a huge thank you to all of our collaborators who helped us make this tool possible.

--

--

People + AI Research @ Google
People + AI Research

People + AI Research (PAIR) is a multidisciplinary team at Google that explores the human side of AI.