Testing Pandas

Nico Gallinal
Hexacta Engineering
9 min readJan 7, 2019

Well, with such a title you may think we started hiring pandas as QA analysts.
I mean who wouldn’t want a little panda sitting next to them? They are so cute! Unfortunately, this is not the case, the legal team advised us against hiring them.

This story is about the “flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more” — pandas github.

I will not start talking about how great of a library it is. It has been out in the wild (pun intended) for quite long and the community has embraced it for many data analysis tasks.

But, what I will do is talk about some ways of testing it. The intrepid reader may think of classical unit testing, that could be one approach until one considers the magnitude of data represented by dataframes.

It is simply impossible to manually create all the many scenarios without letting the time consumed by it tend to infinity.

So, we would like some machinery to create the data for us. And, not only that, we will also like to create that data in a controlled fashion to recreate the different testing scenarios.

We would need to define some invariant of the function being tested (a property of the function that must always hold) and assert if the output is in accordance with our expectations.

Let’s say we have that, and suppose we are working with dataframes of different sizes, maybe we have 90 columns (not a large number at all) and a varying number of rows. It would be desirable that this machinery also provides a way of giving us a simple counterexample (technique known as shrinking) that violates the invariant, and a way of replicating that example.

Furthermore, this machinery could run each test many times as the data is not the same everytime and that may help us find edge cases.

What I have just described above is known as property-based testing.
The first of its kind was QuickCheck for Haskell a long time ago and, since then, many ports have been developed for different languages. I’ve never had the chance of using the original but I have used jsverify for javascript, cluckcheck for schemer and fscheck for c#.

All of them with different flavors and some being better than the others, as it generally occurs.
This held true until I met Hypothesis: “A Python library for creating unit tests which are simpler to write and more powerful when run, finding edge cases in your code you wouldn’t have thought to look for. It is stable, powerful and easy to add to any existing test suite” and determined it is the best one I’ve used. Let me show you with a few examples why I think this and why you should start testing your code with it as well.

Example 1: Level Beginner

Suppose we have the following function defined in the builder.py file.

def fix_new_boxes(raw_prog):
return (
raw_prog
# using -1 as placeholder for mat_code in new boxes
.assign(mat_code=lambda df:
df.mat_code.fillna(NO_MAT_CODE).astype("int64")
)
.sort_values("prog_start")
)

Types annotations are not yet ready for pandas, but we can infer it receives a dataframe with at least two columns: mat_code and prog_starts. And what it does is filling the mat_codes that do not have a value with NO_MAT_CODE. Then it sorts the dataframe by prog_start, which is a date by the way.
So let’s write some tests for it.

Example 1 Test 1

import pandas as pdfrom hypothesis import given
from hypothesis import strategies
from hypothesis.extra.pandas import column, data_frames
import builder@given(
data_frames(
columns=[
column(
name='prog_start',
elements=strategies.datetimes(
min_value=pd.Timestamp(2017, 1, 1),
max_value=pd.Timestamp(2019, 1, 1)
)
, unique=True),
column(
name='mat_code',
elements=strategies.just(float('nan'))
)
])
)
def test_fix_new_boxes_nan_replaced(raw_prog):
prog = builder.fix_new_boxes(raw_prog)
assert (prog.mat_code == builder.NO_MAT_CODE).all()
assert prog.shape == raw_prog.shape

Hey Nico, haven’t you just said this was “beginner level”?
It seems a lot of code! Don’t worry, let’s decompose it little by little.

The “given” annotation accepts strategies and … wait, what is a strategy?
Fair enough, suppose you want to generate datetimes, there are many ways to do that and each of those is called an strategy. Hypothesis provides one strategy for this which is called “datetimes” and it lets you define a min_value and a max_value as you see above.
Let’s look at a few examples from the terminal.

Cool huh? Everytime it is asked for an example it returns a datetime within the given bounds.

In the Test 1 example we can see that these strategies are composable, what allows to create more complex strategies.
Particularly the “data_frames” strategy is composed by the “datetimes” strategy we saw before.
Let’s go to the terminal and see some similar examples using “integers”.

This is getting better, look how we compose strategies and how an empty dataframe is a valid example.

With these examples from the terminal we are able to completely understand what data we are generating for the test except for the “just” strategy which I haven’t mentioned before. It is very simple, it always returns the value passed to it.
There are other strategies provided by the library such as “characters”, “booleans”, “lists”, etc; but we won’t tackle them in this post.

But let’s go back to the “given” annotation. As I was saying, it receives strategies and with them it generates the data that will be used to test the invariant.

There should be one invariant we are testing here, can you think of it? While you think let me show you a pandas image so you get some inspiration.

Awww, it is waving at Hypothesis!!!

The invariant is: “no mat_codes are left as NaNs, they are replace by NO_MAT_CODE”.
I also added an assertion about the shape of both dataframes to make sure they are actually being replaced and not being filtered out.

Example 1 Test 2

import pandas as pdfrom hypothesis import given
from hypothesis import strategies
from hypothesis.extra.pandas import column, data_frames
import builder@given(
data_frames(columns=[
column(name='prog_start',
elements=strategies.datetimes(
min_value=pd.Timestamp(2017, 1, 1),
max_value=pd.Timestamp(2019, 1, 1)
), unique=True),
column(name='mat_code',
elements=strategies.one_of(
strategies.just(float('nan')),
strategies.integers(min_value=100))
)
])
)
def test_fix_new_boxes(raw_prog):
prog = builder.fix_new_boxes(raw_prog)
assert prog.mat_code.notna().all()
assert pd.Index(prog.prog_start).is_monotonic_increasing

Can you tell which is the invariant we are testing? Again, let me show you an image for inspiration.

Here one of them is telling the other that the question was trickier.

We are actually testing two invariants:
1) No NaN values should be present.
2) It must be sorted by prog_start in a monotonic increasing manner.
This test should be refactored in two, because we should test one invariant at a time to have exactly one point of failure.

Have you understand how the strategy “one_of” works? If not, the following examples will make it clear.

It receives different strategies and returns the generated value of one picked randomly.

Example 2: Level Intermediate

Suppose we have the following function defined in the builder.py file.

def add_dayofweek_dummies(data):
return (
data
.assign(dayofweek=lambda df:
pd.Categorical(df.prog_start.dt.weekday_name,
categories=["Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"])
)
.assign(is_weekend=lambda df:
df.dayofweek.isin(["Saturday", "Sunday"])
)
.pipe(pd.get_dummies, columns=["dayofweek"])
)

First of all, what does this function do? It takes a dataframe with a “prog_start” column, it adds a new column called “dayofweek” with the corresponding name, it adds another column called “is_weekend” which is True or False depending the day name and finally it performs one hot encoding on the “dayofweek” column.

How do we test this, Nico? Good question! Let’s see…

Example 2 Test 1

import pandas as pdfrom hypothesis import given
from hypothesis import strategies
from hypothesis.extra.pandas import series, range_indexes
import builderdef assert_is_day(df, i, day):
assert df.loc[i, day] == 1
assert (df.loc[i, ~df.columns.isin(
['prog_start', 'is_weekend', day]
)] == 0).all()
@given(
series(
strategies.datetimes(
min_value=pd.Timestamp(2017, 1, 1),
max_value=pd.Timestamp(2020, 1, 1)
),
index=range_indexes(min_size=7),
unique=True
)
.map(lambda s: s.to_frame('prog_start'))
)
def test_add_dayofweek_dummies_is_day(data):
iso_to_day_asserts = {
1: lambda df, i: assert_is_day(df, i, 'dayofweek_Monday'),
2: lambda df, i: assert_is_day(df, i, 'dayofweek_Tuesday'),
3: lambda df, i: assert_is_day(df, i, 'dayofweek_Wednesday'),
4: lambda df, i: assert_is_day(df, i, 'dayofweek_Thursday'),
5: lambda df, i: assert_is_day(df, i, 'dayofweek_Friday'),
6: lambda df, i: assert_is_day(df, i, 'dayofweek_Saturday'),
7: lambda df, i: assert_is_day(df, i, 'dayofweek_Sunday')
}

dayofweek_dummies = builder.add_dayofweek_dummies(data)
for i in dayofweek_dummies.index:
iso_to_day_asserts
[dayofweek_dummies.loc[i, 'prog_start'].isoweekday()]
(dayofweek_dummies, i)

Here we are testing that the labels of the day are correctly set and that the one hot encoding was done right. I won’t go in much detail of how the test works but I will do explain the new stuff.

We have a “series” and a “range_indexes” strategy and a “map” function.
The “series” strategy let us create series of elements of a given strategy, in this case datetimes.
The “range_indexes” strategy let us create Indexes. We used it here because we don’t want series with fewer than seven elements.

Some examples of range_indexes strategy and series strategy.

And we have the “map” function which let us transform what was generated before reaching the test according to:

s.map(f).example() == f(s.example())

So here we let Hypothesis generate a bunch of series and then we are mapping them to dataframes. Isn’t it awesome?

Example 2 Test 2

Here, we should test the invariant that the weekend days are correctly assigned, I’ll leave that as homework for the reader.

THIS LINE INTENTIONALLY LEFT BLANK

Example 3: Level Advanced

Suppose we have the following function defined in the builder.py file.

def add_top_programs_dummies(df):
top_programs = (
df[df.selected].prog_name.value_counts()
[lambda s: s >= 50].index
)
return (
df
.assign(prog=lambda df:
df.prog_name.where(lambda s: s.isin(top_programs), None)
)
.pipe(pd.get_dummies, columns=["prog"])
)

This function receives a dataframe that contains information about different programs in a period of time and when they were selected.
It takes the ones that were selected more than fifty times, defines them as top programs and performs one hot encoding over them. One may think this example is the same as the one before, but it isn’t. The dataframes we will be generating needs to be constructed in a more specific way.

For this, we will need the help of a new and more powerful strategy.
Meet the “composite”.

But first, one last inspiring image!

Pandas composition :)

Example 3 Test

import pandas as pdfrom hypothesis import given
from hypothesis import strategies
from hypothesis.extra.pandas import data_frames, range_indexes
@strategies.composite
def prog_generator(draw, top_threshold):
top1 = draw(data_frames(columns=[
column(name='prog_name', elements=st.just("TOP1")),
column(name='selected', elements=st.just(True))
], index=range_indexes(min_size=top_threshold)))
top2 = draw(data_frames(columns=[
column(name='prog_name', elements=st.just("TOP2")),
column(name='selected', elements=st.just(True))
], index=range_indexes(min_size=top_threshold)))
notop = draw(data_frames(columns=[
column(name='prog_name', elements=st.text(
alphabet=['a', 'b', 'c', 'd'], min_size=2)),
column(name='selected', elements=st.just(True))
], index=range_indexes(max_size=top_threshold - 1)))
return pd.concat([top1, top2, notop])@given(prog_generator(top_threshold=50))
def test_get_prog_dummies_top_become_dummies(prog):
dummies = builder.add_top_programs_dummies(prog)

assert (
dummies[dummies.prog_name == "TOP1"].prog_TOP1 == 1
).all()
assert (
dummies[dummies.prog_name == "TOP1"].prog_TOP2 == 0
).all()
assert (
dummies[dummies.prog_name == "TOP2"].prog_TOP1 == 0
).all()
assert (
dummies[dummies.prog_name == "TOP2"].prog_TOP2 == 1
).all()

# no dummies for no top progs
assert prog_dummies.shape[1] == 4

According to the documentation: “the composite decorator lets you combine other strategies in more or less arbitrary ways. It’s probably the main thing you’ll want to use for complicated custom strategies.” which is precisely what we want to do!!!

We need to create a single dataframe containing at least a top_threshold amount of TOP1 and TOP2 programs and then many other programs with random names that should appear no more than top_threshold minus one times.

And that is exactly what we are doing thanks to the draw function.
The draw function is always passed as the first argument of the composite and should be thought as a function that returns one example of the strategy it was invoked with.

There is much more of Hypothesis out there but this is all for now.
I hope you have enjoyed reading this article as much as I have enjoyed writing it. Also hopefully the examples were clear and that takes you right away to the Hypothesis site to dive deeper and start using it.

Thanks for reading and stay tuned!

--

--