Use Pandas Dataframe with SQL-Support as a Network Database — Thanks to DuckDB

4 min readApr 13, 2024

If you are a Data Scientist that needs to create or modify large datasets, but then share them as sql-ready data for further services — ChipmunkDB is your solution.

The problem why most databases suck for Time-Series Data

Let's say you want to download a large dataset of historical stock data, modify it with some indicators and then query only sequences out of it, to show it in a frontend.

Here are the minimum requirements:

You need a very fast read and write database, that can save 50 MB to 500 MB in seconds
Adding a new column for all time-series should be done in seconds, without iterating over all rows
Querying only 3 columns in a time range must be possible via SQL
Should be a network database

I have tried hundreds of databases to fix the problem above, but none of them meets all the listed requirements.

DuckDB is a good solution

One day I stumbled upon DuckDB (https://duckdb.org/) and it was astounding. You can easily convert a pandas DataFrame into a SQL Database. It can be saved and loaded to the local disk in a breathtaking speed.

The only issue from DuckDB is, that It's not very good with TimeSeries (not especially) and it's not a network solution.

That's why I started ChipmunkDB.

The name ChipmunkDB came from the fact that chipmunks are very fast while transporting heavy food inside their cheeks.

GitHub - chipmunkdb/chipmunkdb

Contribute to chipmunkdb/chipmunkdb development by creating an account on GitHub.

github.com

Here is an example why ChipmunkDB is so strong for time-series data

Let’s download some DataFrame from yahoofinance.

import yfinance as yf

# download a small dataset
prices_df = yf.download(tickers=["QQQ", "NVDA", "AMD", "AAPL", "TSLA", "MSFT"], period='60d', interval='5m')
prices_df = prices_df.stack()
prices_df.index.names = ['date', 'symbol']

# now we have a dataframe with 28037rows and 6 columns

Now we want to save this data very quickly to a database or better “attach” it to an already existing timeframe.

import time
from chipmunkdb.ChipmunkDb import ChipmunkDb
chipmunkDb = ChipmunkDb("localhost")

# lets calculate the time to save the dataframe to chipmunkdb
start = time.time()
chipmunkDb.save_as_pandas(prices_df, "stock_prices")
print("Time to save: ", time.time()-start, " seconds")

Time to save: 190 milliseconds

So incredible! It only takes 190ms to save the data. If you run it again, it only requires 60ms because it only appends the non-existing rows.

Now let’s read back the full DataFrame to a new pandas DataFrame.

# lets than read the dataframe from chipmunkdb
import time
from chipmunkdb.ChipmunkDb import ChipmunkDb
chipmunkDb = ChipmunkDb("localhost")

# lets read it back and calulate the time
start = time.time()
df = chipmunkDb.collection_as_pandas("stock_prices")
print("Time to load: ", time.time()-start, " seconds")

Time to load: 73 milliseconds

To load the complete DataFrame it takes 73ms. Now you have the full DataFrame available and can append a new column.

# lets add a column
df["Volume_Quote"] = df["Volume"] * df["Close"]

# lets calculate the time to save the dataframe to chipmunkdb again
start = time.time()
chipmunkDb.save_as_pandas(prices_df, "stock_prices")
print("Time to save: ", time.time()-start, " seconds")

We have added a new Column “Volume_Quote” with 28037 rows and saved it to ChipmunkDB.

Time to save: 72 milliseconds

Now we kick in the amazing DuckDB support

But that’s just one of many amazing features of ChipmunkDB. Now we query only a partial out of the DataFrame we created above.

from chipmunkdb.ChipmunkDb import ChipmunkDb
# lets query only the symbols "NVDA" and "AAPL" for the last 10 rows
start = time.time()
results = chipmunkDb.query("select index_symbol, LAST(Close) from stock_prices WHERE index_symbol IN ('AAPL', 'NVDA') group by index_symbol ")
print("Time to query: ", time.time()-start, " seconds")

Time to query: 25 milliseconds

With the support of DuckDB , you can now run any SQL query on your DataFrames inside ChipmunkDB.

Client Libraries are currently NodeJS and Python

I built 2 client libraries.

The NodeJS client library focuses more on querying data out of ChipmunkDB because the pandas DataFrame is not supported in NodeJS.

GitHub - chipmunkdb/chipmunkdb-node-client

Contribute to chipmunkdb/chipmunkdb-node-client development by creating an account on GitHub.

github.com

The python client library supports the full list of pandas and querying features.

GitHub - chipmunkdb/chipmunkdb-python-client

Contribute to chipmunkdb/chipmunkdb-python-client development by creating an account on GitHub.

github.com

The python client also supports SQL Alchemy Network Features. You can use it for example in SuperSet or other BI systems to query data from your pandas. (I will write a story about it in the near future)

Run it in Docker, ContainerD or Kubernetes

I have also built some cloud containers you can use inside your environment to use ChipmunkDB as your database storage.

Here you can find the helm chart:

GitHub - chipmunkdb/chipmunkdb-helm-charts

Contribute to chipmunkdb/chipmunkdb-helm-charts development by creating an account on GitHub.

github.com

Here you can find the docker container:

ChipmunkDB

hub.docker.com

Roadmap for ChipmunkDB

I know it’s a long way to go, but I have some plans for next features coming.

Update the client libraries
Support JOIN by supporting schemas for collections
Add webhook support for receiving updates on collections
Add authentication support
Log Engine support to find changes in collections over time
Support for PgAdmin

Use Pandas Dataframe with SQL-Support as a Network Database — Thanks to DuckDB

The problem why most databases suck for Time-Series Data

DuckDB is a good solution

GitHub - chipmunkdb/chipmunkdb

Contribute to chipmunkdb/chipmunkdb development by creating an account on GitHub.

Here is an example why ChipmunkDB is so strong for time-series data

Now we kick in the amazing DuckDB support

Client Libraries are currently NodeJS and Python

GitHub - chipmunkdb/chipmunkdb-node-client

Contribute to chipmunkdb/chipmunkdb-node-client development by creating an account on GitHub.

GitHub - chipmunkdb/chipmunkdb-python-client

Contribute to chipmunkdb/chipmunkdb-python-client development by creating an account on GitHub.

Run it in Docker, ContainerD or Kubernetes

GitHub - chipmunkdb/chipmunkdb-helm-charts

Contribute to chipmunkdb/chipmunkdb-helm-charts development by creating an account on GitHub.

ChipmunkDB

Roadmap for ChipmunkDB

Written by Thoren Lederer