faster IO with feather

tl;dr — I’m using something called feather to speed up my analysis, both writing and runtime.

Someone mentioned speed on our catch up call yesterday and it got me thinking about questioning the way we have written things so far. I found myself today again writing the code to read a csv and turn a column into datatime python objects.

It is in fact the same file I am always reading in, grid frequency data. This process has been written many times but I couldn’t even hazard a guess how many times the code has been run, it is in the tens of millions at least.

Surely there are better ways than reading in 31557600 csv rows of frequency data, turning into a pandas dataframe, and converting the datatime column? There are, and the solution I’ve gone for is called feather and was written by the creator of pandas, it has two advantages:

1. Metadata — you can save a dataframe to a “.feather” file and it writes the metadata for each column’s datatype — this means that the dates are stored more efficiently than as strings.

2. Columns — feather is built upon the new-ish apache project called arrow, which stores data in a more computer like way: column by column rather than row by row.

So I ran a test on some files (1 second frequency data for one year, split into 12 files)

The big number is that using feather would have been 75 times faster!

As you can see the data are not only quicker to read in, but storing dates as actual date objects rather than strings you make things quicker. I like this solution as it still gives the flexibility of a file over a database but improves speed a lot.

This would have taken a few days off our computation time for the market analysis that we conducted recently… I should be more positive: this will take days off simulation time in the future market analyses!

Example code:

import pandas as pd
df = pd.read_csv(csv_path, names=[“datatime”,”frequency”], header=0)
df[‘datatime’] = pd.to_datetime(df[‘datatime’], format=”%Y-%m-%d %H:%M:%S”)

vs

import feather
df = feather.read_dataframe(feather_path)