Build a free “Duck Lake” on OCI!

Babak Tourani
6 min readJan 15, 2023

--

There’s a small city in Canada called Duck Lake but the pictures of Duck Lake in Utah are more beautiful, so there you go!

Disclaimer: This post is meant to help me jump on the bandwagon of DuckDB and Data Lakes. No, I have no shame! There, I said it!

In this post, I’ll try to set the scene for building a “Duck Lake”: A “data lake“ created using DuckDB and OCI’s Object Storage. You’ll probably be able to replace OCI’s with any other object storage offering by other cloud vendors but if you’re looking for “free” and “value for money,” you’ll be making a mistake!

Why DuckDB?

I’m not going to patronise you by copy-pasting paragraphs of text that say what DuckDB is. If you’re in the data business and didn’t know about it till now, you should take that as a warning about the prospects of your career! Not because of the hype around DuckDB; but because you have failed to pay attention to your surroundings in an industry that is constantly changing.

Ducks are not chickens! Everybody knows that!

But I dare say this: DuckDB is not for chickens!

It’s not for those data operations who feel so insecure about their own merit that they constantly need to show off the total size of their data [at rest] and rely on its perceived size to justify their own existence!

DuckDB is for those precise operations that understand how much data they need to query therefore they can calculate what’s the right amount of compute resources that should be allocated for the processing. It’s also for those folks conscientious about the energy footprint of crunching the data in times that California is flooded and Davos has no snow to offer the billionaires in WEF.

I can’t say it better than Hannes Mühleisen (full interview in the embedded link below), co-creator of DuckDB Labs:

If we have a planet to protect, you cannot really justify using 32 boxes when you don’t need to…. You should never use a 32 nodes Spark cluster! You should use a 1000 nodes Spark cluster!

DuckDB can be a low-cost/high-value tool for EDA or for being used inside data pipelines, especially if your data isn’t coupled to the critical path of a near-real-time monitoring, alerting, or decision-making process.

On the other hand, if the only way for you to generate value using your data is to query large amounts of data or you have to share large amounts of data for concurrent (ad-hoc or not) queries, DuckDB isn’t for you.
At least, not yet.

Why OCI?

Larry Ellison won’t be winning any popularity contests anytime soon, but you’ll just shoot yourself in the foot if you don’t take up his offer of free cloud resources because of your dogmatic stances against him or Oracle!

I use OCI for my side projects and experiments. OCI gives you a ton of free resources which not only puts other cloud vendors to shame but also [at least at this point] they are meant to stay #AlwaysFree.

If you decide to become a paying customer, you can use other free resources in the portfolio without paying anything, as well; for example unlike any of the “Big 3” cloud vendors, you’d get to use NAT Gateways for free on OCI (you’d still pay for data transfer, of course) or you can have a free Kubernetes cluster if you use your #AlwaysFree quota for your worker nodes. Nobody else would give you that!

In this specific case, I’ll be relying on OCI’s Object Storage which gives you 10GBs of #AlwaysFree space; twice anything that others (except IBM cloud — 25GBs, but it won’t give you any decent compute capacity for free) give you. Also, the Object Storage comes with Amazon S3 Compatibility API which means you can even use the famous boto3 library to access the objects.

Put that alongside the two free databases in the quota and it’s perfectly feasible to set up a tiered query-able storage system.

My Usecase

I’m planning to use Object Storage to offload data from my side project into it in Parquet format and create a standard outlet to access the data. My side project uses Streamlit for presentation so it’s very easy to embed a DuckDB in it (I will not showcase that here in this post), populate the DB using the data in the Object Storage, and drive ad-hoc queries using this in-process analytical DB.

DuckDB is able to query Parquet files natively and I’ll be using DuckDB’s httpfs extension which makes it possible to query remote objects, too.

All I have to do is to place my data in an OCI Object Storage bucket.

Connection Details

Querying Parquet files on OCI using DuckDB is similar to the process for querying objects on Google Cloud Storage. You need three elements:

  • s3_endpoint
  • s3_access_key_id
  • s3_secret_access_key

S3 endpoints for OCI are defined as:

<OCI-Tenancy-Namespace>.compat.objectstorage.<OCI-Region>.oraclecloud.com

The OCI Tenancy Namespace can be retrieved from OCI console

To find the OCI tenancy namespace, click on “Profile” (top right), and “Tenancy”. You’ll find the namespace in “Object storage settings”

S3 access key id and secret access key can be generated/extracted using the method described by Shadab Mohammad (see step 1).

Sample Code

I’ve used a piece of sample data from NYC TLC for yellow taxi trip records in January 2022, created a bucket in OCI Object Storage called `nyc-tlc-data`, and uploaded the file `yellow_tripdata_2022–01.parquet` in it.

import duckdb

# Create an in-memory database
cursor = duckdb.connect()

# Load the httpfs extension for loading
# remote data
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")

# Define Connection to OCI Object Storage
cursor.execute("SET s3_endpoint='lrvgzjrvugg2.compat.objectstorage.uk-london-1.oraclecloud.com'")
cursor.execute("SET s3_access_key_id='2b346...dd30f'")
cursor.execute("SET s3_secret_access_key='xbbjP...x3i0='")
cursor.execute("SET s3_region='uk-london-1'")
cursor.execute("SET s3_url_style='path'")
cursor.execute("SET s3_use_ssl=true")

# Load the remote Parquet data into a table.
cursor.execute("""
CREATE TABLE yello_trips_202201 as
SELECT * FROM read_parquet('s3://nyc-tlc-data/yellow_tripdata_2022-01.parquet')
""")

# Query the table
duck_df = cursor.execute("""
SELECT
passenger_count
, count(*) as no_of_trips
FROM yello_trips_202201
WHERE passenger_count >= 1
GROUP BY
passenger_count
ORDER BY
passenger_count
""").df()

print(duck_df.head())

Results:

   passenger_count  no_of_trips
0 1.0 1794055
1 2.0 343026
2 3.0 84570
3 4.0 35321
4 5.0 51338

Notes

  • The DuckDB database is created in memory. To persist the data in a local file, the connection to the DuckDB engine should be set up using `cursor = duckdb.connect(‘<filename>’)`
  • You don’t have to create a table when you read the Parquet data. The same calculation could have been done on the `read_parquet(…)` data structure, as well. However, creating a separate table would make it possible to use the same data again in multiple other queries without reading from the Parquet file.

References

--

--

Babak Tourani

Principal Data Engineer at Dice.fm. Formerly BBC World Service TV/Radio presenter and producer, and assistant coach of Iran's Basketball team