Use Pandas Dataframe with SQL-Support as a Network Database — Thanks to DuckDB

Thoren Lederer
4 min readApr 13, 2024

--

If you are a Data Scientist that needs to create or modify large datasets, but then share them as sql-ready data for further services — ChipmunkDB is your solution.

The problem why most databases suck for Time-Series Data

Let's say you want to download a large dataset of historical stock data, modify it with some indicators and then query only sequences out of it, to show it in a frontend.

Here are the minimum requirements:

  1. You need a very fast read and write database, that can save 50 MB to 500 MB in seconds
  2. Adding a new column for all time-series should be done in seconds, without iterating over all rows
  3. Querying only 3 columns in a time range must be possible via SQL
  4. Should be a network database

I have tried hundreds of databases to fix the problem above, but none of them meets all the listed requirements.

DuckDB is a good solution

One day I stumbled upon DuckDB (https://duckdb.org/) and it was astounding. You can easily convert a pandas DataFrame into a SQL Database. It can be saved and loaded to the local disk in a breathtaking speed.

The only issue from DuckDB is, that It's not very good with TimeSeries (not especially) and it's not a network solution.

That's why I started ChipmunkDB.

The name ChipmunkDB came from the fact that chipmunks are very fast while transporting heavy food inside their cheeks.

Here is an example why ChipmunkDB is so strong for time-series data

Let’s download some DataFrame from yahoofinance.

import yfinance as yf

# download a small dataset
prices_df = yf.download(tickers=["QQQ", "NVDA", "AMD", "AAPL", "TSLA", "MSFT"], period='60d', interval='5m')
prices_df = prices_df.stack()
prices_df.index.names = ['date', 'symbol']

# now we have a dataframe with 28037rows and 6 columns

Now we want to save this data very quickly to a database or better “attach” it to an already existing timeframe.

import time
from chipmunkdb.ChipmunkDb import ChipmunkDb
chipmunkDb = ChipmunkDb("localhost")

# lets calculate the time to save the dataframe to chipmunkdb
start = time.time()
chipmunkDb.save_as_pandas(prices_df, "stock_prices")
print("Time to save: ", time.time()-start, " seconds")

Time to save: 190 milliseconds

So incredible! It only takes 190ms to save the data. If you run it again, it only requires 60ms because it only appends the non-existing rows.

Now let’s read back the full DataFrame to a new pandas DataFrame.

# lets than read the dataframe from chipmunkdb
import time
from chipmunkdb.ChipmunkDb import ChipmunkDb
chipmunkDb = ChipmunkDb("localhost")

# lets read it back and calulate the time
start = time.time()
df = chipmunkDb.collection_as_pandas("stock_prices")
print("Time to load: ", time.time()-start, " seconds")

Time to load: 73 milliseconds

To load the complete DataFrame it takes 73ms. Now you have the full DataFrame available and can append a new column.

# lets add a column
df["Volume_Quote"] = df["Volume"] * df["Close"]

# lets calculate the time to save the dataframe to chipmunkdb again
start = time.time()
chipmunkDb.save_as_pandas(prices_df, "stock_prices")
print("Time to save: ", time.time()-start, " seconds")

We have added a new Column “Volume_Quote” with 28037 rows and saved it to ChipmunkDB.

Time to save: 72 milliseconds

Now we kick in the amazing DuckDB support

But that’s just one of many amazing features of ChipmunkDB. Now we query only a partial out of the DataFrame we created above.

from chipmunkdb.ChipmunkDb import ChipmunkDb
# lets query only the symbols "NVDA" and "AAPL" for the last 10 rows
start = time.time()
results = chipmunkDb.query("select index_symbol, LAST(Close) from stock_prices WHERE index_symbol IN ('AAPL', 'NVDA') group by index_symbol ")
print("Time to query: ", time.time()-start, " seconds")

Time to query: 25 milliseconds

With the support of DuckDB , you can now run any SQL query on your DataFrames inside ChipmunkDB.

Client Libraries are currently NodeJS and Python

I built 2 client libraries.

The NodeJS client library focuses more on querying data out of ChipmunkDB because the pandas DataFrame is not supported in NodeJS.

The python client library supports the full list of pandas and querying features.

The python client also supports SQL Alchemy Network Features. You can use it for example in SuperSet or other BI systems to query data from your pandas. (I will write a story about it in the near future)

Run it in Docker, ContainerD or Kubernetes

I have also built some cloud containers you can use inside your environment to use ChipmunkDB as your database storage.

Here you can find the helm chart:

Here you can find the docker container:

Roadmap for ChipmunkDB

I know it’s a long way to go, but I have some plans for next features coming.

  • Update the client libraries
  • Support JOIN by supporting schemas for collections
  • Add webhook support for receiving updates on collections
  • Add authentication support
  • Log Engine support to find changes in collections over time
  • Support for PgAdmin

--

--