Clean up your datasets with dftest

Atai Ambus
6 min readApr 2, 2022

--

Introduction

Not too long ago, I came across a public dataset by The Metropolitan Museum of Art in New York (“The Met”) containing data for about 227 thousand artworks over 54 fields.

Honestly, that’s just an invitation for some interesting data science. After some looking though, I discovered it is very messy — from an entire column of empty values, and not a single row with all fields collected, to just about no fields with a consistent format or even type. Ouch.

Datasets like this are not uncommon, and messy or low quality data may occur any time data is collected manually by many people or over many years, as standards and collected fields change and people are more prone to mistakes and consistencies (another example is medical records). Data Cleaning is a well known concept in data science.

Detecting issues and cleaning up your data can be a very tedious process though, and it grows more tedious the bigger your dataset is. Even after you’re done, it’s likely you’ll still have missed some issues. Then the process repeats as more data is collected or the data collection mechanism is changed.

In the field of Artifical Intelligence, as we move from model-centric practices into more data-centric ones, putting the amount and quality of data as a centerpiece of our work, the problem grows both in scope and in importance.

In this article I will present dftest—a python library and command line tool to streamline the process of data cleaning and help make it more complete.

What is dftest?

dftest is a python package inspired by the unit testing package pytest.

Whereas traditional unit testing allows you to continuously check that your program is working as expected, dftest allows you to continuously check that your dataset contains valid data.

Like pytest, dftest allows you to write your tests on the fly as simple functions in your code; it includes a command line tool that can scan files for these functions by name convention, run them, and give you detailed statistics about the result, including details about testing coverage, individual columns and rows, and the density of errors across them.

dftest does not fix problematic datasets by itself, but it can be a big help in identifying problematic fields or rows, and testing which formats a field is using.

You can easily install dftest with pip:

pip install dftest

dftest requires a tabular dataset of formats CSV, TSV or JSON, and a python file containing test functions. For this example we will use the mentioned Metropolitan Museum OpenAccess database, which you can download here.

We will create our own python file for our tests, which we’ll call tests.py and put alongside the dataset, though you can also use any existing python file and simply add the test functions. Anything that isn’t a top level function with a dftest prefix will be simply ignored.

Index and Boolean Tests

dftest tests can be any python function that takes in a dataframe and outputs any of the following:

  • A boolean value indicating column validity
  • An iterable of hashable values (usually ints) indicating indexes of invalid rows
  • A boolean array of column length indicating each row’s validity

Tests that return a boolean are called Boolean Tests. Tests that return a hashables iterable or a boolean array are called Index Tests.

As an example, the Object Begin Dateand Object Start Date columns in our dataset detail the approximate range of years in which the artwork was created. We may want to ensure that they are indeed valid integers of past years.

As a rule of thumb, using Index Tests will get you more extensive analytics, and using Boolean Tests will get you better speed.

We may not care much for extensive analytics anyway, and just want to ensure a large enough portion of the column is valid, so for the following section we will use the boolean test only:

Results and statistics

You can run your tests on the database from terminal like so:

dftest --dataframe MetObjects.csv --files test.py --graph validity
  • The --dataframe option is used to specify a dataset file (CSV, TSV and JSON are permitted)
  • The --files option is used to specify a test file
  • The --graph option is used to specify the type of graph to display.

Running the command with The Met’s dataset and a test file containing the boolean test shown in the previous section should open the following graph:

Which shows simply that:

  • Object Begin Date and Object End Date had less than 5% invalid values (hence they are marked green)
  • That no other column was tested (hence they are empty)
  • That the dataframe as a whole has no failed tests (hence is is marked green)

Note that dftest will recognize which columns were tested automatically, by determining which of the dataframe columns are accessed as each test runs.

Generic and Concrete Tests

We have somewhat run into a small problem: each test can only give one result, so we had to write two separate tests, one for Object Begin Date and one for Object End Date. On a larger scale, this would be extremely inconvenient.

Now, dftest supports a simple solution: instead of performing tests on a dataframe, we can perform tests on some column in a dataframe. Thus, we can generalize the example from the previous section:

pydantic 2

A test like this (that acts on “some column” in the dataframe) is called a generic test, and a test that acts on concrete columns is called a concrete test.

(dataframe, column)

Important: the order of parmeters is significant. trying to add a function that takes(dataframe, column) instead of (column, dataframe) will cause error and unexpected behaviour.

Specifying columns for generic tests

dftest detects generic tests automatically, and by default they run on all columns of the dataframe. We can configure exactly which columns the test by using the dftest.options decorator.

We can specify:

  • Which columns to test
  • Which columns not to test
  • Which dtypes to test

Pick the way you want to specify the columns and run the command again—you should see the same graph as before.

Ready-Made Tests

In addition to running user-made tests dftest also has a tests submodule containing a variety of common tests. For example, to test for missing values you can add such a test to a file by simply adding the following import statement:

Now, Running

dftest --dataframe MetObjects.csv --files test.py --graph validity

Will Result in the following graph:

Detailing the rate of missing values in each column (all tests in dftest.test are Index Tests)

tests also contains test “makers”, which can be used to create new tests easily; for example, we can use it to yet farther simplify our original test:

Saving and Loading Results

You can save the results to file and load them later using the --dump and --results option as follows:

# Saving
dftest --dataframe MetObjects.csv --files tests.py --dump results.dump
# Loading
dftest --results results.dump --graph validity

important: results are saved in the pickle format. It is possible to construct a malicious pickle file which will execute arbitrary code when loaded; only load files you trust.

Read More

This is just a small taste of the full power of dftest. Fuller demonstrations can be found here, including more options, explanations, techniques and edge cases, along with a thorough overview of useful tests functions and features for styling graphs and dumping invalid lines.

Though currently undocumented, dftest also has an extensive python API which can be used to access its raw tests results or add tests more dynamically.

And of course, the source code is free and public on github.

--

--