Dataset vs Ground-Truth Dataset

Wafaa Arbash
Universal Data Tool
3 min readAug 21, 2020

Datasets are a collection of samples of data. The data can be images, audio, text, matrices of numbers, or even rows of an Excel spreadsheet. When doing machine learning, you’ll usually need a ground-truth dataset. A ground-truth dataset is a regular dataset, but with annotations added to it. Annotations can be boxes drawn over images, written text indicating samples, a new column of a spreadsheet, or anything else the machine learning algorithm should learn to output.

A couple quick examples:

  • If you’re predicting risk of someone defaulting on a loan, your dataset may be a spreadsheet containing information about people. Your ground-truth dataset would include a column that indicates if the person defaulted or not
  • If you’re trying to identify animals in a picture, your dataset might be images of pets. Your ground-truth dataset would be images of pets with bounding boxes showing where the animals are in the image, as well as labels that indicate what animal is in the box.

How do I get a dataset?

Many datasets companies use come from customer data, such as engagement or spending information that could help predict when a promotion should be sent to a user. There are also a host of online datasets available:

  • The python library scikit-learn has some easy python methods for importing toy datasets that can be used for gaining familiarity with machine learning
  • The U.S. Government releases a lot of data about Public Safety, Research, Education and more at data.gov.
  • Google recently introduced a dataset search that helps you find freely available datasets

How do I get a ground-truth dataset?

There are many freely available tools for annotating a dataset to make it a ground-truth dataset such as Universal Data Tool, Label Studio and Labelimg.

You can see screenshots and interactive search of different machine learning tools using Compare Data Tools.

Companies will also annotate your data for you. The biggest challenge for annotating data externally is tracking the quality of your labels. We recommend taking a sample slice of your data from the annotation company and measure the percentage of samples that are correctly annotated.

Training Dataset vs Ground-truth Dataset

“Training dataset” and “ground-truth dataset” are sometimes used interchangeably, but they are actually slightly different. When building a machine learning dataset, you break your ground-truth dataset into two smaller datasets: the training dataset and the testing dataset. You then train your machine learning algorithm on the training dataset and test its ability to work on the testing dataset.

be sure to follow us on Twitter, or join our Slack to hear more!

--

--