Nullable Integers
How to include missing values in your pandas integer columns
An Amazing Release
An amazing thing happened in pandas
0.24.0
; the most amazing thing that has happened to pandas
in years. Finally it’s possible to have missing values in your int64
columns! Before this release, any missing values would cause the column’s data type to change to float64
. This issue has been hanging around in a dark alley for years ready to give new pandas
users a swift shake.
Check It Out
Here’s an example of an int64
column turning into float64
. There are a set of customers
and a set of orders
. However, not every customer has an order. When the data sets are merged together there is a row for customer 2 that doesn’t have a quantity
.
In orders
the data type of quantity
is int64
. Below in customer_orders
it changes to float64
. Also, notice that the quantity
values have an extra “.0” added at the end to indicate that the column is now a float64
.
This is unexpected and confusing. The reason that pandas
changes the int64
to float64
is because of the missing value in the column. Since pandas
can’t be sure that the missing value isn’t a decimal number, it automatically converts the column’s data type. While this approach makes data manipulations safer, it is a surprise the first time you see it and it can cause errors for seemingly no reason.
An Easy Solution
pandas
0.24.0
added a wonderful work-around to the problem. Now there is an IntegerArray
data type that allows for null values. Huzzah! Let’s take a look at the new data type in action:
The only difference with this code is that it sets the data type of quantity
as Int64
using astype
. In the merged DataFrame
, quantity
remains an Int64
even though there are missing values. In addition, notice the lack of “.0”s after the quantity
values. This simple fix resolves the issue!
Recommendation
The new IntegerArray
type solves the problem of int64
columns mysteriously changing to float64
whenever there are missing values. I recommend explicitly declaring an integer fields as type Int64
whenever you need an integer data field. This will prevent any unexpected surprises in later data processing steps.