S3 Express One Zone and Embedded Databases

Sirsh Amarteifio
3 min readApr 6, 2024

--

embedded databases on s3 express

This morning I was wondering about using S3 express with LanceDB and Duck DB. This is documented here and there but its not completely obvious how to go about it. However the possible speed improvements make it worth a few minutes tinkering. If you want to try it out, this is what I did (after upgrading all my python libs — boto, duckdb, lance, lancedb)

In AWS console there is new “Directory buckets” tab were you can create an S3 Express bucket— and that much is easy enough. When you create it, take note of the availability zone.

Duck DB documented how to set things up and I added this to my existing constructor for my Duck DB client — note the zone at the end in the last block.

self._cursor = duckdb.connect(**options)
AWS_ACCESS_KEY_ID = os.environ.get("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.environ.get("AWS_SECRET_ACCESS_KEY")
if AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY:
creds = f"""
SET s3_region='us-east-1';
SET s3_access_key_id='{AWS_ACCESS_KEY_ID}';
SET s3_secret_access_key='{AWS_SECRET_ACCESS_KEY}';"""

#install the httpfs stuff
self._cursor.execute(
f"""
INSTALL httpfs;
LOAD httpfs;
{creds}
"""
)

# from S3 console check avail zone - az6
self._cursor.execute(
f"""
CREATE SECRET (
TYPE S3,
REGION 'us-east-1',
KEY_ID '{AWS_ACCESS_KEY_ID}',
SECRET '{AWS_SECRET_ACCESS_KEY}',
ENDPOINT 's3express-use1-az6.us-east-1.amazonaws.com'
);
"""
)

Lance also documented how to configure access — the default storage option has S3 express turned off but you can enable it — here I write a sample CSV and tried to create a dataset and this works.

import pandas as pd
import lance
data = pd.read_csv('~/Downloads/test_sample.csv')
ds = lance.write_dataset(
data,
"s3://DIR_BUCKET_PREFIX--use1-az6--x-s3/db.lance",
storage_options={
"region": "us-east-1",
"s3_express": "true",
}
)
ds

I was digging around the LanceDB docs (this sits on top of lance I guess). My code uses LanceDB for some vector search use cases and I could not immediately see how to get this to work with directory buckets. Luckily, lance, which is used under the hood allows the above storage option to be set as an environment variable…

import os
os.environ['aws_s3_express'] = 'true'

With this I can now create LanceDB tables on S3 express

import lancedb
import pandas as pd
data = pd.read_csv('~/Downloads/test_sample.csv')

c = lancedb.connect('s3://DIR_BUCKET_PREFIX--use1-az6--x-s3')
c.create_table('db1', data)

And I can see these on my S3 directory bucket

This is all a case of easy if you know how but as of this morning I did not know how. And as of the time of writing I still am not sure how to change my code that uses s3fs to use s3 buckets seamlessly but thats a problem for another day or one I may just side step.

--

--