Sitemap
Coiled

Coiled helps you use Python on the cloud easily and efficiently.

Follow publication

Process Hundreds of GB of Data in the Cloud with Polars

Coiled
4 min readNov 20, 2023

--

Code snippet of using the coiled.function decorator to run a query with Polars on a large VM in the cloud.

Query S3 data locally with Polars

def load_data():
# Define the S3 path and storage options and load the data into memory
s3_path = 's3://coiled-datasets/uber-lyft-tlc/*'
storage_options = {"aws_region": "us-east-2"}
return pl.scan_parquet(s3_path, storage_options=storage_options)

def compute_percentage_of_tipped_rides(lazy_df):
# Compute the percentage of rides with a tip for each license number
result = (
lazy_df.with_columns((pl.col("tips") > 0).alias("tipped"))
.group_by("hvfhs_license_num")
.agg([
(pl.sum("tipped") / pl.count("tipped")).alias("percentage_tipped")
])
)
return result.collect()

def query_results():
# Load and process the dataset to compute the results
lazy_df = load_data()
return compute_percentage_of_tipped_rides(lazy_df)

Query S3 data in the cloud with Coiled serverless functions

@coiled.function(
vm_type="m6i.4xlarge", # VM with 64GB RAM
region="us-east-2", # AWS region to match the data location
keepalive="5 minutes", # Keep VM alive for potential subsequent queries
)
def query_results():
# Load and process the dataset to compute the desired metric
lazy_df = load_data()
return compute_percentage_of_tipped_rides(lazy_df)
result = query_results()
print(result)
[
('HV0002', 0.08440046492889111),
('HV0003', 0.1498555901186066),
('HV0004', 0.09294857737045926),
('HV0005', 0.1912300216459857),
]

What just happened

Plot from the Coiled dashboard showing the memory increase on our VM
VM Memory Utilization Increasing as the Computation Runs

Conclusion

--

--

Coiled
Coiled

Published in Coiled

Coiled helps you use Python on the cloud easily and efficiently.

Responses (2)