Verifying Correctness Between RAPIDS cuDF and Pandas Workflows

Published in

RAPIDS AI

3 min readOct 17, 2019

RAPIDS Tips and Tricks

RAPIDS accelerates Python analytics workflows up to 100x on GPUs with minimal code changes. Before shipping new code to production, workflow changes often need to pass checks, like unit and integration tests, that verify correctness and make sure there are no regressions. To save time, RAPIDS users frequently do quick and dirty correctness checks while porting their workflow to run on GPUs. Sometimes these tests suggest something may have been lost in translation.

In this blog, we’ll walk through a quick checklist that we’ve found helps us find discrepancies, which are listed in order of how frequently each one has been the root cause of a test failure. These small checks can often save you hours of slicing and dicing your dataframes looking for subtle discrepancies that might not actually exist.

Dtype Checking

Are the dtypes the same for every column? You can do a quick check that the column dtypes are the same with df1.dtypes.equals(df2.dtypes). This won’t catch everything. Some differences can be more subtle. For example, in pandas, datetimes can be represented as strings in object columns or actual timestamps in object columns. Just seeing objects isn’t enough, as object is an overloaded term that applies to strings, nested types, and more. If you’re working with timestamp or datetime typed data, check the actual value in the first row to see the granular data type.

Index Checking

Do the indices match? Out of order indices, or differing indices, can fail dataframe equality checks. This is particularly relevant when comparing results from a parallel algorithm (such as those used in RAPIDS cuDF) to results from a sequential algorithm (such as those used in pandas). Like with Dtype checking, you can do a quick check with df1.index.equals(df2.index).

Column Order

Are the columns out of order, or do you have different columns? If the columns are out of order, the underlying arrays will be different which will result in failing equality tests. Checks like df1[df2.columns].equals(df2) and df2[df1.columns].equals(df1)can help triage this.

Floating Point Representation

Does the floating point representation differ? Whether it’s just precision differences or actual floating point arithmetic differences, small discrepancies in floating point values can cause test failures. If your dataframe passes the three correctness checks listed above, and the values generally look correct on a quick visual inspection, consider using something like np.testing.assert_array_approx_equal on the underlying data. This function (and others like it) allow you to set the decimal argument to be something small, like three or four.

Checklist

Are the dtypes the same for every column?
Do the indices match?
Are the columns out of order, or do you have different columns?
Does the floating point representation differ?

Summary

In our experience, users can generally run through the checklist in a few minutes. This could help avoid spending hours digging in the weeds when it might not have been necessary. Want to get started with RAPIDS and access the 100x speedups? Check out the Getting Started webpage, with links to help you download pre-built Docker containers or install directly via Conda.