Nullable Integers

How to include missing values in your pandas integer columns

Eric Ness
When I Work Data
2 min readJun 4, 2019

--

Photo by 童 彤 on Unsplash

An Amazing Release

An amazing thing happened in pandas 0.24.0; the most amazing thing that has happened to pandas in years. Finally it’s possible to have missing values in your int64 columns! Before this release, any missing values would cause the column’s data type to change to float64. This issue has been hanging around in a dark alley for years ready to give new pandas users a swift shake.

Check It Out

Here’s an example of an int64 column turning into float64. There are a set of customers and a set of orders. However, not every customer has an order. When the data sets are merged together there is a row for customer 2 that doesn’t have a quantity.

In orders the data type of quantity is int64. Below in customer_orders it changes to float64. Also, notice that the quantity values have an extra “.0” added at the end to indicate that the column is now a float64.

This is unexpected and confusing. The reason that pandas changes the int64 to float64 is because of the missing value in the column. Since pandas can’t be sure that the missing value isn’t a decimal number, it automatically converts the column’s data type. While this approach makes data manipulations safer, it is a surprise the first time you see it and it can cause errors for seemingly no reason.

An Easy Solution

pandas 0.24.0 added a wonderful work-around to the problem. Now there is an IntegerArray data type that allows for null values. Huzzah! Let’s take a look at the new data type in action:

The only difference with this code is that it sets the data type of quantity as Int64 using astype. In the merged DataFrame, quantity remains an Int64 even though there are missing values. In addition, notice the lack of “.0”s after the quantity values. This simple fix resolves the issue!

Recommendation

The new IntegerArray type solves the problem of int64 columns mysteriously changing to float64 whenever there are missing values. I recommend explicitly declaring an integer fields as type Int64 whenever you need an integer data field. This will prevent any unexpected surprises in later data processing steps.

--

--

Eric Ness
When I Work Data

Principal Machine Learning Engineer at C.H.Robinson, a Fortune 250 supply chain company.