Clean up your datasets with dftest
--
Introduction
Not too long ago, I came across a public dataset by The Metropolitan Museum of Art in New York (“The Met”) containing data for about 227 thousand artworks over 54 fields.
Honestly, that’s just an invitation for some interesting data science. After some looking though, I discovered it is very messy — from an entire column of empty values, and not a single row with all fields collected, to just about no fields with a consistent format or even type. Ouch.
Datasets like this are not uncommon, and messy or low quality data may occur any time data is collected manually by many people or over many years, as standards and collected fields change and people are more prone to mistakes and consistencies (another example is medical records). Data Cleaning is a well known concept in data science.
Detecting issues and cleaning up your data can be a very tedious process though, and it grows more tedious the bigger your dataset is. Even after you’re done, it’s likely you’ll still have missed some issues. Then the process repeats as more data is collected or the data collection mechanism is changed.
In the field of Artifical Intelligence, as we move from model-centric practices into more data-centric ones, putting the amount and quality of data as a centerpiece of our work, the problem grows both in scope and in importance.
In this article I will present dftest
—a python library and command line tool to streamline the process of data cleaning and help make it more complete.
What is dftest?
dftest
is a python package inspired by the unit testing package pytest
.
Whereas traditional unit testing allows you to continuously check that your program is working as expected, dftest
allows you to continuously check that your dataset contains valid data.
Like pytest
, dftest
allows you to write your tests on the fly as simple functions in your code; it includes a command line tool that can scan files for these functions by name convention, run them, and give you detailed statistics about the result, including details about testing coverage, individual columns and rows, and the density of errors across them.
dftest
does not fix problematic datasets by itself, but it can be a big help in identifying problematic fields or rows, and testing which formats a field is using.
You can easily install dftest
with pip:
pip install dftest
dftest
requires a tabular dataset of formats CSV, TSV or JSON, and a python file containing test functions. For this example we will use the mentioned Metropolitan Museum OpenAccess database, which you can download here.
We will create our own python file for our tests, which we’ll call tests.py
and put alongside the dataset, though you can also use any existing python file and simply add the test functions. Anything that isn’t a top level function with a dftest
prefix will be simply ignored.
Index and Boolean Tests
dftest
tests can be any python function that takes in a dataframe and outputs any of the following:
- A boolean value indicating column validity
- An iterable of hashable values (usually ints) indicating indexes of invalid rows
- A boolean array of column length indicating each row’s validity
Tests that return a boolean are called Boolean Tests. Tests that return a hashables iterable or a boolean array are called Index Tests.
As an example, the Object Begin Date
and Object Start Date
columns in our dataset detail the approximate range of years in which the artwork was created. We may want to ensure that they are indeed valid integers of past years.
As a rule of thumb, using Index Tests will get you more extensive analytics, and using Boolean Tests will get you better speed.
We may not care much for extensive analytics anyway, and just want to ensure a large enough portion of the column is valid, so for the following section we will use the boolean test only:
Results and statistics
You can run your tests on the database from terminal like so:
dftest --dataframe MetObjects.csv --files test.py --graph validity
- The
--dataframe
option is used to specify a dataset file (CSV, TSV and JSON are permitted) - The
--files
option is used to specify a test file - The
--graph
option is used to specify the type of graph to display.
Running the command with The Met’s dataset and a test file containing the boolean test shown in the previous section should open the following graph:
Which shows simply that:
Object Begin Date
andObject End Date
had less than 5% invalid values (hence they are marked green)- That no other column was tested (hence they are empty)
- That the dataframe as a whole has no failed tests (hence is is marked green)
Note that dftest
will recognize which columns were tested automatically, by determining which of the dataframe columns are accessed as each test runs.
Generic and Concrete Tests
We have somewhat run into a small problem: each test can only give one result, so we had to write two separate tests, one for Object Begin Date
and one for Object End Date
. On a larger scale, this would be extremely inconvenient.
Now, dftest
supports a simple solution: instead of performing tests on a dataframe, we can perform tests on some column in a dataframe. Thus, we can generalize the example from the previous section:
A test like this (that acts on “some column” in the dataframe) is called a generic test, and a test that acts on concrete columns is called a concrete test.
(dataframe, column)
Important: the order of parmeters is significant. trying to add a function that takes(dataframe, column)
instead of (column, dataframe)
will cause error and unexpected behaviour.
Specifying columns for generic tests
dftest
detects generic tests automatically, and by default they run on all columns of the dataframe. We can configure exactly which columns the test by using the dftest.options decorator.
We can specify:
- Which columns to test
- Which columns not to test
- Which dtypes to test
Pick the way you want to specify the columns and run the command again—you should see the same graph as before.
Ready-Made Tests
In addition to running user-made tests dftest
also has a tests
submodule containing a variety of common tests. For example, to test for missing values you can add such a test to a file by simply adding the following import statement:
Now, Running
dftest --dataframe MetObjects.csv --files test.py --graph validity
Will Result in the following graph:
Detailing the rate of missing values in each column (all tests in dftest.test
are Index Tests)
tests
also contains test “makers”, which can be used to create new tests easily; for example, we can use it to yet farther simplify our original test:
Saving and Loading Results
You can save the results to file and load them later using the --dump
and --results
option as follows:
# Saving
dftest --dataframe MetObjects.csv --files tests.py --dump results.dump# Loading
dftest --results results.dump --graph validity
important: results are saved in the pickle format. It is possible to construct a malicious pickle file which will execute arbitrary code when loaded; only load files you trust.
Read More
This is just a small taste of the full power of dftest
. Fuller demonstrations can be found here, including more options, explanations, techniques and edge cases, along with a thorough overview of useful tests
functions and features for styling graphs and dumping invalid lines.
Though currently undocumented, dftest also has an extensive python API which can be used to access its raw tests results or add tests more dynamically.
And of course, the source code is free and public on github.