Optimizing the size of a pandas dataframe for low memory environment

In a previous post I was going through loading csv in dataframes using chunks to filter the only rows we need. However in some cases, despite filtering, your resulting dataframe is still bigger than the available memory on your environment. One way to address that is to specify data types of your dataframe in a more efficient way than the automatic detection done by Pandas.

Numerical columns:

Depending on your environment, pandas automatically creates int32, int64, float32 or float64 columns for numeric ones. If you know the min or max value of a column, you can use a subtype which is less memory consuming. You can also use an unsigned subtype if there is no negative value.

Here are the different subtypes you can use:

int8 / uint8 : consumes 1 byte of memory, range between -128/127 or 0/255

bool : consumes 1 byte, true or false

float16 / int16 / uint16: consumes 2 bytes of memory, range between -32768 and 32767 or 0/65535

float32 / int32 / uint32 : consumes 4 bytes of memory, range between -2147483648 and 2147483647

float64 / int64 / uint64: consumes 8 bytes of memory

If one of your column has values between 1 and 10 for example, you will reduce the size of that column from 8 bytes per row to 1 byte, which is more than 85% memory saving on that column!

Categorical columns

Pandas stores categorical columns as objects. One of the reason this storage is not optimal is that it creates a list of pointers to the memory address of each value of your column. For columns with low cardinality (the amount of unique values is lower than 50% of the count of these values), this can be optimized by forcing pandas to use a virtual mapping table where all unique values are mapped via an integer instead of a pointer. This is done using the category datatype.

Changing the datatype on an existing dataframe:

For numerical columns, this is done by downcasting the column type:

# downcasting a float column
df['col1'] = df['col1'].apply(pd.to_numeric,downcast=’float’)
# downcasting an integer column
df['col2'] = df['col2'].apply(pd.to_numeric,downcast=’unsigned’)

For categorical columns, you simply declare the category type. But first a useful feature that will show you the unique and count of each columns so you can decide which one should be converted:

df_description = df.select_dtypes(include=[‘object’]).copy() gl_obj.describe()

And now for the columns we want to optimize :

df['col3'] = df['col3'].astype(‘category’)

Setting the right datatypes when loading a csv:

First let’s load a subset of our csv to iterate through the column types and get the column names. We will then apply the optimization seen above to this subset so we can create a dict with index names and correct datatype that we will use when loading the full csv.

df = pd.read_csv(‘./dataset.csv’, nlines=20)
# Now, with the knowledge of your data, you should declare the optimized datatype for the right columns:
...
df['col1'] = df['col1'].apply(pd.to_numeric,downcast=’float’)
df['col2'] = df['col2'].apply(pd.to_numeric,downcast=’unsigned’)
df['col3'] = df['col3'].astype(‘category’)
...
# create the dict of index names and optimized datatypes
dtypes = df.dtypes
colnames = dtypes.index
types = [i.name for i in dtypes.values]
column_types = dict(zip(colnames, types))

And finally we use read_csv, passing the previous dict to tell pandas to load the data the way we want:

df_optimized = pd.read_csv('./dataset.csv',dtype=column_types)

In most of the cases your dtype can be downcasted or categorized, this will results in huge memory savings. I reached 90% memory saving in some of my working datasets.

You can combine this with the previous post : https://medium.com/@vincentteyssier/filtering-csv-files-bigger-than-memory-to-a-pandas-dataframe-3ab51ff993fd and you will probably be able to load a 6Gb dataset in only 1Gb of memory :)

Enjoy!