Process Hundreds of GB of Data with DuckDB in the Cloud

Scale out your workload to a big VM in the cloud.

Published in

Coiled

4 min readAug 7, 2023

DuckDB is great tool for running efficient queries on large datasets. When you want cloud data proximity or need more RAM, Coiled makes it easy to run your Python function in the cloud. In this post we’ll use Coiled Functions to process the 150 GB Uber-Lyft dataset on a single machine with DuckDB.

Query Parquet data with DuckDB

We start with creating the SQL queries that we want to run against the data locally.

def load_data(conn):
  # Load data into memory so that subsequent queries are fast
  conn.execute(
    '''
    CREATE TABLE test AS 
    SELECT * FROM read_parquet("s3://coiled-datasets/uber-lyft-tlc/*")
    '''
  )


def compute_percentage_of_tipped_rides(conn):
    # Run the actual query
   return conn.execute(
       '''
       SELECT hvfhs_license_num, sum(tipped) / count(tipped)
        FROM (select 
           *,  
           CASE WHEN tips > 0.0 then 1 ELSE 0 end as tipped
           from test) GROUP BY hvfhs_license_num
       '''
   ).fetchall()


def create_conn():
    import duckdb

    return duckdb.connect()


def query_results():
    conn = create_conn()
    load_data(conn)
    return compute_percentage_of_tipped_rides(conn)

These queries aren’t particularly fancy, they are meant to illustrate how we can process these files. If we execute these queries as is, it would pull all the data onto our machine. The whole dataset won’t fit in memory on most workstations, so let’s look at how Coiled can make this work. The loading would take a long time, even if we had enough memory.

Query S3 data in the cloud with DuckDB + Coiled serverless functions

Coiled Functions come into the equation since we need access to machines that have enough resources and are also close to our data. Coiled can connect to AWS or GCP and thus, use all resources that are available there. We will go through the necessary steps execute these queries on a VM in the same region as our data with enough memory available.

We’ll have to adapt our create_conn function to use load_aws_credentials and set the AWS region using the DuckDB AWS extension.

def create_conn():
    import duckdb
    
    conn = duckdb.connect()
    conn.execute("CALL load_aws_credentials()")
    return conn

The next step is adding the @coiled.function decorator to the function that executes our queries. The decorator will tell Coiled that it should spin up a large VM on AWS and run the query there, and then return the result locally.

@coiled.function(
    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
)
def query_results():
    conn = create_conn()
    load_data(conn)
    return compute_percentage_of_tipped_rides(conn)

Let’s execute our queries and pull the results back to our local machine:

result = query_results()

print(result)
[
  ('HV0005', 0.1912300216459857), 
  ('HV0003', 0.1498555901186066), 
  ('HV0004', 0.09294857737045926), 
  ('HV0002', 0.08440046492889111),
]

The data is now all in memory on our VM in the cloud:

Plot from the Coiled dashboard showing 150 GB of memory usage on our VM. — Memory usage goes up to 150 GB as we load the dataset on a big VM in the cloud.

There is no need to adjust the other functions. Coiled will run our query on a VM in the cloud with enough resources and close to our data.

Let’s take a brief look at the arguments to coiled.function():

vm_type: This specifies the type of AWS EC2 instance. We are looking for an instance that has enough memory to hold our data. This instance has 256GB, so this should be sufficient.
region: The region specifies the AWS region that our VM is started in. Our data are also in "us-east-2".
keepalive: Keeps the VM alive so that we can run multiple queries against the data in memory.

coiled.function() will now start a VM in AWS with the specified EC2 instance. The VM is normally up and running in 1-2 minutes. Coiled will scan our local environment and replicate the same dependencies on this machine. We don't have to specify an explicit Python environment. Inputs of your function are serialized and sent to the VM as well. Coiled will return our results back to our local machine.

Coiled would normally shut down the VM immediately after the Python interpreter finishes. This is mostly to reduce costs. We specified keepalive="5 minutes" to keep the VM alive for a few minutes after our Python interpreter finished. This ensures that new local runs can connect to the same VM avoiding the boot time of up to 2 minutes; we call this a warm start.

Conclusion

You can use Coiled serverless functions to run queries on a machine with as much memory as you want. This grants you access to computational resources that can be very close to your data. Doing data processing in the cloud becomes very easy with this functionality.

Want to run this example yourself? Get started with Coiled for free at coiled.io/start. This example run comfortably within the free tier.

You can also check out the docs or take a look at how to use Coiled serverless functions to run the same query using Polars.

Process Hundreds of GB of Data with DuckDB in the Cloud

Scale out your workload to a big VM in the cloud.

Query Parquet data with DuckDB

Query S3 data in the cloud with DuckDB + Coiled serverless functions

Conclusion

Written by Patrick Hoefler