Serverless Spatial Analysis with DuckDB and AWS Lambda — Part 1 “Making it work”.

Nabil Servais
7 min readFeb 17, 2023

--

And it’s not expensive.

Geospatial analysis is not limited to maps, it is present every day in many areas of our lives (TODO examples with illustration). However, the specificities make more complicated the storage and the calculation related to the spatial functions, how to represent a point or a polygon? How to detect on a point is or how to calculate the distance between a point and a polygon? How to take into account that the map is flat but the earth is a patatoid.
All the answers to these questions can be found in JTS and its derivatives (like GEOS). Moreover, one of the derivatives made it possible to bring spatial data into PostgreSQL via its PostGIS extension.
Even OLAP databases integrate spatial components (such as bigquery or its competitor Redshift based on PostgreSQL) and even the latest DuckDB since very recently (based on PostGIS).
An interesting feature of DuckkDB is that it is truly serverless like SQLite. No server, no infrastructure needed to use. In addition, it is now possible to decouple storage and computation like BigQuery.
DuckDB can query CSV or Parquet files available on an S3 bucket or locally.
In a context where keeping a PostgreSQL instance costs a certain amount and consumes energy, we will see how to use DuckDB and its Geo extension in a Serverless context.

After an excellent experience with Cloud Run, I felt it was time to explore AWS Lambda.

disclaimer: my employer Mangrove (AWS partner) is not involved in this project and it is entirely possible to replicate this architecture and the work done with Cloud Run (GCP) or Azure Function (Azure) or even other FaaS providers.

The initial data provided for this project will be the Land Value Applications in Open Data. They have been adapted and reformatted by CEREMA to be usable by the mapping tools (many thanks to them and to all those who have made the use of this data possible).

The architecture is quite simple : an object storage, an AWS lambda and an http interface. Where a more classical architecture would be more complicated and costly.

Simplified architecture diagram

This gives on AWS

How can you have such a simple cloud architecture?
First of all, DuckDB can be considered as the SQLite of OLAP databases, that is to say . Like its equivalent, it can be embedded in memory in its code or be standalone. I really invite you to read the documentation and to look at the state of DuckDB .

DuckDB like most other OLAP databases can query parquet or CSV files (in addition to its internal storage but I’ll test that another time).
This also has an impact in terms of cost, an RDS database to perform this kind of operation would cost $189 not to mention the ecological impact (the database server needs to be turned on constantly).

DuckDB’s Geo extension is largely based on PostGIS, which allows many of the functions available in PostGIS to be reused.

The extension compiles quite easily (just like DuckDB), however it is still under active development which does not exclude some compile bugs, so do not hesitate to give feedback.
Before starting the application, it is first necessary to test if spatial analysis is possible.
First step, the transformation of gpkg files into parquet. As mentioned before, DuckDB only reads parquet, CSV and JSON files (sqlite is supported natively as an external extension). Fortunately, the Swiss army knife of map files natively supports the parquet format.

Now the file is ready, you can now perform data mining on your computer without having to bring out the heavy artillery.
DuckDB has the good idea to sign its extensions, however Geo is not yet an official extension, it will be necessary to disable the checks.
Queries can be made directly on the parquet file.

Or in memory after import.

Determining the average price of goods as well as other indicators according to the distance to a point is quite easy (it could also be done according to a polygon).
All that remains is to upload the file to S3 and we are ready.
As simple as the analysis part was, the compilation for AWS Lambda will be much less simple.
First try — Python layer (spoiler: it was a failure)
AWS Lambda may be a serverless service, but it still requires an OS to run. And when the OS in question is based on a slightly dated Linux distribution, it makes packaging the application more complex.
For reasons that escape me, building the layer proved impossible, impossible for the function to find these libraries. I think I’ll come back to this in more detail (and in English this time).
Second try — What about Go or Rust?
I could have gone for Go and Rust, two excellent, efficient languages, but I chose to be safe by choosing a language officially supported by the DuckDB team.
Now I have an application able to execute DuckDB queries in an AWS Lambda. I will also need to compile the Amazon Linux 2 extension, as the dependencies are already available, the compilation will (if any AWS Lambda developers are reading me, please upgrade to Amazon Linux 2022).
Shortly after my success, a user pointed out the existence of this repo which would have saved me several hours of work (fortunately the macbook M1 is really powerful).

Also, the article from the same author go deeper to how to deploy DuckDB on AWS Lambda.

Remember to strip the extension before packing it into its function, copy all the functions into the directory and you are ready.

To make the requests, nothing very magical, only javascript code, quickly done badly to prove that it works.

We are instantiating our libraries.

const AWSXRay = require('aws-xray-sdk-core');
const duckdb = require('duckdb')
var collect = require('@turf/boolean-intersects');
var db = new duckdb.Database(':memory:', {"allow_unsigned_extensions":"true"});;

We configure DuckDB.

const bucket = process.env.S3_BUCKET
db.run(`SET threads TO 4;`, function (err, res) {if (err) {throw err;}console.log(res);});db.run(`PRAGMA temp_directory='/tmp/'` ,function(err, res) {if (err) {throw err;}console.log(res);});
const conn = new duckdb.Connection(db);conn.run(`LOAD './geo.duckdb_extension';`,function(err, res) {if (err) {throw err;}console.log(res);});
conn.run(`LOAD './parquet.duckdb_extension';`,function(err, res) {if (err) {throw err;}console.log(res);});
conn.run("LOAD './httpfs.duckdb_extension';",function(err, res) {if (err) {throw err;}console.log(res);});

To "speed up" the search, the departments are indexed like an index but I will come back to this in part II (make it fast).

conn.run("CREATE TABLE index_dep AS SELECT * EXCLUDE geometry, ST_GEOGFROMWKB(geometry) AS geom FROM 's3://"+ bucket +"/immobiliers/departements.parquet'"
function(err, res) {if (err) {throw err;}console.log(res);});

And finally we assemble our request with the javascript code (I promise if this post is more than 10 likes, I’ll start using typescript).

let response
exports.handler = async (event, context) => {try {let r = [];const body = JSON.parse(event.body);console.log("EVENT: \n" + JSON.stringify(event));console.log(body);const fc = body.feature;const distance = body.distance;console.log(distance);let query_dep = "SELECT code FROM index_dep WHERE ST_CONTAINS(geom, ?)";const stream_dep = conn.stream(query_dep, JSON.stringify(fc.features[0].geometry));for await (const row of stream_dep) {console.log(row);r.push(row);}if (r.length == 0)return {statusCode:200, body: JSON.stringify({resuls:0})}num_dep = r[0].code;console.log(num_dep);r = []let query_bien = `WITH biens as (SELECT libtypbien,anneemut,ST_PERIMETER(ST_GEOGFROMWKB(geom), true) AS surface,valeurfonc,ST_DISTANCE(?, ST_GEOGFROMWKB(geom)) as dist,? as distance,FROM 's3://` + bucket + `/immobiliers/mutations_d` + num_dep + `.parquet'WHERE dist < distance)SELECT libtypbien AS "type de bien",anneemut AS "annnée de vente",count(1) as "nombre de biens",ROUND(SUM(surface), 2) AS "surface totale",ROUND(AVG(valeurfonc), 2) AS "valeur fonciere moyenne",ROUND(MAX(valeurfonc), 2) AS "valeur fonciere max",ROUND(MIN(valeurfonc),2) AS "valeur fonciere min",MIN(dist),MAX(dist),FROM biens GROUP BY all ORDER BY "nombre de biens";`console.log(fc.features);const stream = conn.stream(query_bien, JSON.stringify(fc.features[0].geometry), distance);for await (const row of stream) {console.log(row);r.push(row)}response = {statusCode: 200,body: JSON.stringify({results: r,}),};} catch (err) {console.log(err);return err;}
return response;};
;

Now that I have an API to query my data, I need a user interface. Fortunately there are good data providers outside of Google Maps, and the choice fell on Maptiler which supports maplibre, a graphical web map viewer.

AND THERE you can click on the map and get an analysis of property prices over a given distance (I think I’ll add the polygon mapping soon enough).

https://gnopdata-public.s3.eu-west-1.amazonaws.com/index.html

Feel free to use it and give feedback.

--

--