Profiling DuckDB with AWS Lambda

Gary Li
6 min readFeb 27, 2023

--

Mainstream OLAP data analysis engines, such as Redshift and Spark, use distributed architecture to distribute computing tasks to various working nodes. They utilize large clusters of machines to improve data processing throughput. However, if we only need to analyze and process a few gigabytes of data, it would be overkill to set up a big data cluster. The data size can fit into the memory of a single host, but there are no suitable tools available. DuckDB was born for this kind of scenario. It completes data analysis tasks on a single host and single process.

DuckDB is derived from SQLite and is positioned very similarly to SQLite. DuckDB’s relationship with OLAP is as unique as SQLite’s relationship with OLTP. Initially, DuckDB’s slogan was also “DuckDB — The SQLite for Analytics”. It remains to be seen whether DuckDB can achieve the same popularity as SQLite. As the project documentation states:

“There are many Database Management Systems (DBMS). However, there is no one-size-fits-all database system. All of these require different trade-offs to better suit specific use cases. DuckDB is no exception.”

DuckDB has rich OLAP functionality and powerful single-machine performance, including:

  1. Full SQL syntax support.
  2. Columnar vector query computation engine.
  3. Rich data analysis functions.
  4. Universal storage files, supporting computation and storage separation.
  5. Support for transaction ACID.
  6. C++ implementation of the computation engine.

After learning about the capabilities of DuckDB, I attempted to deploy it on AWS Lambda. The process was smoother than expected. This article summarizes the entire deployment process and demonstrates how to build a lightweight OLAP engine using the flexibility of serverless and the data processing capabilities of DuckDB. Finally, the performance and limits of DuckDB on Lambda will be tested using NYC Taxi trip data from 2020 and 2021.

To create a Lambda Layer:

Lambda Layers provide a convenient way to package libraries and other dependencies that can be used with Lambda functions. Using layers can reduce the size of the deployed archive and speed up the deployment of the code.

A layer is a .zip file archive that can contain other code or data. Layers can include libraries, custom runtimes, data, or configuration files. Layers facilitate code sharing and separation of responsibilities so that you can iterate faster on writing business logic. As DuckDB has dependencies on native code libraries, it must be compiled and built with a Linux development computer to ensure binary compatibility with Amazon Linux. Therefore, we need to create a Lambda Layer on Amazon Linux.

For detailed instructions, please refer to: https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html

Step 1: Install and create a compressed package.

pip install duckdb -t python/
zip -r duckdb.zip python
aws s3 cp duckdb.zip s3://<s3_bucket>

Step 2: Publish the Lambda Layer.

aws lambda publish-layer-version - layer-name duckdb \\
- content S3Bucket=<s3_bucket>,S3Key=duckdb.zip \\
- compatible-runtimes python3.7

Create Lambda:

You can choose to create a lambda function on the console or using the command line. Lambda is configured for 10GB of memory. When creating it, you need to select the lambda layer made in the previous step. The whole process is relatively simple, and the specific steps are not elaborated here.

Write function code: The lambda function will support three input parameters:

  1. List of Parquet files.
  2. Execution processing statement.
  3. Result output path. This article focuses on the core processing ability of DuckDB, and readers can consider wrapping it into an HTTP interface using API Gateway for easier use. The function code mainly implements the following logic:
  4. Install and load the httpfs plugin.
  5. Initialize AWS authentication information.
  6. Map the input file to the view: input_table.
  7. Execute the query analysis statement. Store the calculation result in the output table.
  8. Save the output result table to the designated path.

The Python source code is as follows:

import json
import os
def lambda_handler(event, context):
import duckdb
env=os.environ
# to start an in-memory database
con = duckdb.connect(database=':memory:')
home_directory="/tmp/duckdb/"
if not os.path.exists(home_directory) :
os.mkdir(home_directory)
#1.install httpfs plugin
con.execute(f"SET home_directory='{home_directory}';INSTALL httpfs;LOAD httpfs;")
#2.init aws credential
con.execute(f"SET s3_region='{env['AWS_REGION']}';")
con.execute(f"SET s3_access_key_id='{env['AWS_ACCESS_KEY_ID']}';")
con.execute(f"SET s3_secret_access_key='{env['AWS_SECRET_ACCESS_KEY']}';")
con.execute(f"SET s3_session_token='{env['AWS_SESSION_TOKEN']}';")
#3.maping input files to view input_table.
tbs=','.join(["'"+file+"'" for file in event['input_files']])
sql="CREATE VIEW input_table AS SELECT * FROM read_parquet(["+tbs+"]);";
print(sql)
con.execute(sql)
#4.store query result to output table.
query=event['query']
con.execute(f" create table output_table AS {query};")
output_file=event['output_file']
#5.copy output table to s3 bucket.
con.execute(f"COPY output_table TO '{output_file}' (FORMAT PARQUET);")
1
return {
'statusCode': 200,
'body': json.dumps('Query Executed!')
}

Data Preparation:

I obtained the test data for yellow taxi trips from 2020 to 2021 from the AWS Marketplace data set, “New York City Taxi and Limousine Commission (TLC) Trip Record Data”. It is recommended to copy the data to the same region as your Lambda function to reduce data transfer time. File names: yellow_tripdata_2020–00.parquet

yellow_tripdata_2021–12.parquet

Total record count: 55,228,574 Number of files: 24 File size: approximately 900MB Sample data:

{
"VendorID": 1,
"tpep_pickup_datetime": "2020–01–01T00:28:15.000Z",
"tpep_dropoff_datetime": "2020–01–01T00:33:03.000Z",
"passenger_count": 1,
"trip_distance": 1.2,
"RatecodeID": 1,
"store_and_fwd_flag": "N",
"PULocationID": 238,
"DOLocationID": 239,
"payment_type": 1,
"fare_amount": 6,
"extra": 3,
"mta_tax": 0.5,
"tip_amount": 1.47,
"tolls_amount": 0,
"improvement_surcharge": 0.3,
"total_amount": 11.27,
"congestion_surcharge": 2.5
}

Data Preparation:

I obtained the test data for yellow taxi trips from 2020 to 2021 from the AWS Marketplace data set, “New York City Taxi and Limousine Commission (TLC) Trip Record Data”. It is recommended to copy the data to the same region as your Lambda function to reduce data transfer time. File names: yellow_tripdata_2020–00.parquet — yellow_tripdata_2021–12.parquet Total record count: 55,228,574 Number of files: 24 File size: approximately 900MB Sample data:

{
"VendorID": 1,
"tpep_pickup_datetime": "2020–01–01T00:28:15.000Z",
"tpep_dropoff_datetime": "2020–01–01T00:33:03.000Z",
"passenger_count": 1,
"trip_distance": 1.2,
"RatecodeID": 1,
"store_and_fwd_flag": "N",
"PULocationID": 238,
"DOLocationID": 239,
"payment_type": 1,
"fare_amount": 6,
"extra": 3,
"mta_tax": 0.5,
"tip_amount": 1.47,
"tolls_amount": 0,
"improvement_surcharge": 0.3,
"total_amount": 11.27,
"congestion_surcharge": 2.5
}

Function testing:

Creating test input:

Lambda execution output: Duration: 1533.82 ms Billed Duration: 1534 ms Memory Size: 10240 MB Max Memory Used: 155 MB

Data output:

Performance Testing:

In the performance testing experiment, we prepared 5 query statements and gradually increased the input files to observe changes in execution time.

Query 1:

select * from input_table

No data filtering, merge multiple input files into a large data file and output to s3.

Query 2:

select * from input_table where passenger_count >2

Filters out 17% of the data and outputs it to s3.

Query 3:

select * from input_table where passenger_count >3

Filters out 11% of the data and outputs it to s3.

Query 4:

select strftime(tpep_pickup_datetime, '%d/%m/%Y') pickup_date ,
count(*) trip_count, sum(passenger_count) total_passenger,
sum(total_amount) amount
from input_table
group by strftime(tpep_pickup_datetime, '%d/%m/%Y')

Calculate the passenger count and expenses on a daily basis and store 24 summary data to s3. The size is 17.1 KB.

Test results:

Summary:

For Query 4, the performance of DuckDB on Lambda is satisfactory. For 900M, 60 million records, 24 parquet files, it can complete the date summary statistics within 5 seconds.

From Queries 2 and 3, we can see the limitations of DuckDB. As it is based on single-process data processing, the query duration will increase linearly with the increase of data volume. While other OLAP databases can balance workload by utilizing distributed architectures.

In Query 1, when the output data size reaches 25 million, with a data file size of 700M, Lambda reports an error and cannot execute. Since DuckDB can only output a single Parquet file, the output quantity size will significantly affect the data.

Future work can consider preserving the surprise testing. DuckDB is still a relatively new framework, and there are no large-scale production deployment cases in the industry. Readers can test DuckDB according to their own scenarios. However, its positioning and performance also inspire many technical experts, and we welcome everyone to exchange and explore together.

--

--