Fast Data Store for Pandas Time-Series Data using PyStore

Ran Aroussi
8 min readJun 5, 2018

--

As algorithmic traders, we need a lot of data to test and optimize our strategy ideas. Over time, the amount of data adds up and the search for a reliable, efficient, and easy-to-use storage solution begins.

When I first confronted this issue, my immediate go-to solution was to use a database like MySQL or PostgreSQL. But as a Python developer and a heavy Pandas user, I was hoping for a more “Pythonic” solution. I was looking for a solution that will allow me to store Pandas dataframes fast and fetch them even faster — all while keeping the data itself as portable as possible.

The closest solution I found was the very impressive Arctic library by Man AHL, but it required me to run a MongoDB server, and the data wasn’t very easily portable.

I finally ended up with a solution that may raise some eyebrows, but I feel that it works far better in the context of small to mid-sized trading operations.

The solution I chose was not using a database at all. 😲

You see, databases are designed to do a wide variety of things with data, many of which require checks and safety mechanisms to be implemented to do these things effectively and safely. This actually hurts the performance of what you are trying to do.

On the other hand, market time-series data is unique, both in how it is used and how it is stored. The user drobertson explained it best in this Stack Overflow comment:

Market time series data is stored in a completely different way. In fact, I would say it is prepared rather than stored. Each data item only needs to be written once and after that never needs to be modified or changed. Data items can be written sequentially, there is no need to insert anything in the middle. It needs no ACID functionality at all. They have little to no references out to any other data. The time series is effectively its own thing.

So… the solution was to use a flat-file database, using a binary file format where I can store Pandas dataframes, and that supports compression to save on disk space and make portability easier. Luckily, the Parquet file format seemed to fit the bill just right :)

The next thing was to write a tool that will allow me to read and write such files in a “pythonic” way.

Enters PyStore…

I created PyStore to help me store and retrieve Pandas dataframes, and while it can store any Pandas object, it was designed with storing time-series data in mind.

It’s built on top of Pandas, Numpy, and Dask and stores the data in the Parquet file format (via Fastparquet) in a hierarchical directory structure. Files are compressed using Snappy, a fast and efficient compression/decompression library from Google.

The end result?

A powerful, pythonic datastore for Pandas dataframes that can easily query millions of rows in sub-second speed.

PyStore provides namespaced collections of data. These collections allow bucketing data by source, user, or some other metric (for example frequency: End-Of-Day; Minute Bars; etc.). Each collection (or namespace) maps to a directory containing partitioned Parquet files for each item (e.g. symbol).

Getting started with PyStore

UPDATE: I’ve created a Jupyter notebook to help following this guide easier.

Let’s get started with PyStore by first installing it using `pip`:

$ pip install PyStore

INSTALLATION NOTE: If installation fails, you probably need to install Snappy first (compression/decompression library). You can install Snappy C library with the following commands:

  • APT: sudo apt-get install libsnappy-dev
  • RPM: sudo yum install libsnappy-devel
  • Brew: brew install snappy

* macOS users need to run this after installing Snappy’s C library:

CPPFLAGS="-I/usr/local/include -L/usr/local/lib" pip install python-snappy

* Windows users should check out Snappy for Windows and this Stackoverflow post for help on installing Snappy and python-snappy.

Next, let’s get some market data to work with. We’ll use Quandl’s API to download 37+ years worth of historical data for Apple’s stock.

import quandlaapl = quandl.get('WIKI/AAPL', authtoken='your token here')
aapl.head()

The output should look something like this:

Date        Open  High   Low … Adj. Close Adj. Volume1980–12–12 28.75 28.87 28.75 …   0.422706   1172584001980–12–15 27.38 27.38 27.25 …   0.400652    439712001980–12–16 25.37 25.37 25.25 …   0.371246    264320001980–12–17 25.87 26.00 25.87 …   0.380362    216104001980–12–18 26.63 26.75 26.63 …   0.391536    18362400

Next, let’s explore the PyStore library…

We’ll start by importing it into our code.

import pystore

Next, we’ll tell PyStore where to save our datastore. This step is optional, and unless specified, the default path is set to ~/.pystore).

# Set storage path
pystore.set_path('/usr/share/pystore')

We can get a list of datastores found this location. Since we’re just getting started, all we’ll get is an empty list.

# List stores
pystore.list_stores()
# returns: []

Now it’s time to create our first datastore.

# Connect to datastore (create it if not exist)
store = pystore.store('mydatastore')

Now when we call pystore.list_stores() we’ll get a list with our new datastore listed.

pystore.list_stores()
# returns: [‘mydatastore’]

Before we can save our APPL time-series data, we need to create a Collection.

As mentioned earlier, each collection (or namespace) maps to a directory containing partitioned Parquet files for each item (e.g. symbol).

# Access a collection (create it if not exist)
collection = store.collection('NASDAQ.EOD')
# List all collections in the datastore
store.list_collections()
# returns ['NASDAQ.EOD']

Now we’re ready to store our data. For demo purposes, we won’t be storing the last row, which will be appended later. We’ll also attach some metadata indicating the data source.

# Store the data in the collection under AAPL
%time collection.write('AAPL', aapl[:-1], metadata={'source': 'Quandl'})

On my laptop (2015 MacBook Pro), saving 37+ years' worth of historical data took less than 16 milliseconds. Not too shabby.

CPU times: user 14.5 ms, sys: 2.75 ms, total: 17.2 ms
Wall time: 15.8 ms

To see all available items (symbols) in the collection, we’ll use:

# List all items in the collection
collection.list_items()
# returns: [‘AAPL’]

Now let’s see how long does it take to read the data:

# Reading the item’s data
%time item = collection.item(‘AAPL’)
CPU times: user 4.38 ms, sys: 1.11 ms, total: 5.49 ms
Wall time: 4.5 ms

4.5 Milliseconds. NICE!

Let’s take a closer look at the returned item object, specifically the data property, which returns a Dask dataframe and the metadata property, which returns, well, the metadata.

# Load AAPL Dask dataframe (see dask.pydata.org)
data = item.data

(TL;DR; A Dask DataFrame is a large parallel dataframe composed of many smaller Pandas dataframes, split along the index. These pandas dataframes may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask dataframe operation triggers many operations on the constituent Pandas dataframes.)

# Load AAPL metadata
item.metadata
# returns:
# {'source': 'Quandl', '_updated': '2018–06–05 16:04:56.203825'}

To get the entire data as a Pandas dataframe, we call:

# load data as Pandas dataframe
df = item.to_pandas()
df.tail()

We should get something like this:

Date         Open   High    Low … Adj. Close Adj. Volume2018–03–20 175.24 176.80 174.94 …    175.240    193140392018–03–21 175.04 175.09 171.26 …    171.270    352473582018–03–22 170.00 172.68 168.60 …    168.845    410510762018–03–23 168.39 169.92 164.94 …    164.940    402489542018–03–26 168.07 173.10 166.44 …    172.770    36272617

To check how fast we can read the data as a Pandas dataframe, we can run this code:

%time collection.item('AAPL').to_pandas()CPU times: user 7.28 ms, sys: 1.26 ms, total: 8.54 ms
Wall time: 7.57 ms

It took 7.57 milliseconds to read the data and convert it back into Pandas dataframe format on a 2015 MacBook Pro. Not too bad at all…

Let’s append the last day (row) to our item:

# Append a row to AAPL
collection.append('AAPL', aapl[-1:])
collection.item('AAPL').to_pandas().tail()

We should see our new row added:

Date         Open   High    Low … Adj. Close Adj. Volume2018–03–20 175.24 176.80 174.94 …    175.240    193140392018–03–21 175.04 175.09 171.26 …    171.270    352473582018–03–22 170.00 172.68 168.60 …    168.845    410510762018–03–23 168.39 169.92 164.94 …    164.940    402489542018–03–26 168.07 173.10 166.44 …    172.770    362726172018–03–27 173.68 175.15 166.92 …    168.340    38962839

That’s the basic stuff that should allow you to get started using PyStore in no time.

But there’s more…

After a while, you’ll have many items stored, and you may want to look some of them up by metadata. To do this, simply add your metadata key to the list_items method:

# Query avaialable symbols based on metadata
collection.list_items(source='Quandl')
# returns: ['AAPL']

Snapshot functionality

When working with data, there will be times when you’ll accidentally mess up the data, making it unusable. For that reason, PyStore allows you to create snapshots — a point-in-time, named reference for all current items in a collection.

Creating a snapshot is done using the create_snapshot method:

# Snapshot a collection
collection.create_snapshot('snapshot_name')
# List available snapshots
collection.list_snapshots()
# returns ['snapshot_name']

To see how snapshots work, let’s change our original AAPL to only include the `Close` and `Volume` columns.

# Change the current dataset
collection.write('AAPL', aapl[['Close', 'Volume']],
metadata={'source': 'Quandl'},
overwrite=True)
# Load the “new” item
collection.item('AAPL').to_pandas().tail()

As we can see, we’ve lost all data except for the `Close` and `Volume` columns.

Date         Close   Volume2018–03–21 171.270 352473582018–03–22 168.845 410510762018–03–23 164.940 402489542018–03–26 172.770 362726172018–03–27 168.340 38962839

However, we can load our snapshot data at any time using:

# Get a version of a symbol given a snapshot name
snap_df = collection.item('AAPL', snapshot='snapshot_name')
snap_df.to_pandas().tail()

And, Voilà, we can see our data again.

Date         Open   High    Low … Adj. Close Adj. Volume2018–03–20 175.24 176.80 174.94 …    175.240    193140392018–03–21 175.04 175.09 171.26 …    171.270    352473582018–03–22 170.00 172.68 168.60 …    168.845    410510762018–03–23 168.39 169.92 164.94 …    164.940    402489542018–03–26 168.07 173.10 166.44 …    172.770    362726172018–03–27 173.68 175.15 166.92 …    168.340    38962839

We can, of course, restore our data from the snapshot:

collection.write('AAPL', snap_df.to_pandas(),
metadata={'source': 'Quandl'},
overwrite=True)

Lastly, we’ll delete the snapshot:

# Delete a collection snapshot
collection.delete_snapshot('snapshot_name'
# To delete all snapshots, use:
# collection.delete_snapshots()

To delete the collection and the datastore, run:

# Delete the item from the current version
collection.delete_item('AAPL')
# Delete the collection
store.delete_collection('NASDAQ.EOD')

That concludes this quick tutorial of PyStore.

I hope you’ll find it useful as I do.

Speed test

In the above example, we’ve used EOD data (9,400 rows). As you recall, the write time was 15.8 ms and the read time (to Pandas) was 7.57 ms.

Let’s try writing and reading 5 years worth of minute level data for the VIX Futures continuous contract (2,319,867 rows).

# Test write speed
%time collection.write('VX', vixdf)
CPU times: user 491 ms, sys: 148 ms, total: 639 ms
Wall time: 419 ms

Writing, obviously, takes longer here, as we’re dealing with 246 times more data. But, still, 419 milliseconds is not bad.

Let’s see how long it takes us to read that amount of data.

# Test read speed
%time collection.item('VIX')
CPU times: user 4.19 ms, sys: 2.24 ms, total: 6.42 ms
Wall time: 5.32 ms

It took me 5.32 milliseconds to reading 2,319,867 records on my Laptop.

But the big question, is how long will it take to read the data *and* convert it to Pandas dataframe?

Let’s see:

# Test read speed
%time collection.item('VIX').to_pandas()
CPU times: user 170 ms, sys: 79 ms, total: 249 ms
Wall time: 179 ms

So… it takes PyStore 179 milliseconds to read and parse 2,319,867 records. Now, this is impressive!

Remember I mentioned portability earlier? Well, moving your datastore can’t be easier. Simply copy your datastore directory (default is ~/.pystore/<DATASTORE>) to an external drive to move it to a new machine.

For peaking inside your Parquet files, I recommend Chip, a lightweight parquet viewer.

PyStore is now available via pip or directly from Github page.

It currently works with a local filesystem, and I plan on adding support for Amazon S3, Google Cloud Storage, and Hadoop Distributed File System in the future.

Please let me know what you think!

--

--

Ran Aroussi

Family man, coder, trader, entrepreneur, tech geek, and founder of @tradologics. I was born to be awesome, not perfect.