Unlocking the power of DuckDB for geospatial analysis

Ageospatial
5 min readJan 17, 2024

--

Managing large amounts of spatial data is a challenge when performing geospatial analysis. Over the past decade or two, the main way of managing spatial data was by using spatial databases such as ESRI Geodatabase and most notably PostGIS which was conceived in 2001 as an extension of PostgreSQL. Developed by Refractions Research, PostGIS has transformed the way geospatial data was handled and is now commonly cited as the most popular spatial database worldwide. However, 20 years later, DuckDB was introduced to revolutionize and simplify the way we manage and analyze our big data, aiming to provide an advanced, efficient solution for handling large-scale datasets.

This article aims to shed light on DuckDB’s capabilities, particularly through its Spatial Extension, by providing an in-depth look with relevant examples. We also want to show how it could synergize well with datasets queried using our Geo+ GPT, which is currently available on the GPT store. This serves as a proof of concept for the standalone web platform currently in development and planned for Spring 2024, as we want to equip users who want to play around with large datasets with the most powerful tools. Administrative boundaries in the following example have been downloaded using the Geo+ GPT but due to limitations set by OpenAI’s platform, we had to download the building footprints directly from Microsoft Building Footprints.

For those seeking a deeper dive, the spatial extension documentation on DuckDB’s official website, Dr. Qiusheng Wu’s tutorial series, and Chris Holmes’ article are excellent resources.

The article was written to be accessible even for beginners but fundamental knowledge of spatial SQL is a plus.

What is DuckDB?

DuckDB: A new approach to geospatial analysis

DuckDB distinguishes itself as a free, open-source, in-process, embedded, relational, and OLAP (Online Analytical Processing) Database Management System (DBMS). Among its array of extensions, DuckDB introduces a Spatial extension, mirroring the capabilities of PostGIS but considerably improving the speed at which data is queried with its unique columnar approach and has the added benefit of not requiring a server to run. Its columnar data storage approach, which stores data in columns rather than rows, is particularly effective for analytical queries, resulting in faster processing and reduced disk space usage.

https://www.heavy.ai/technical-glossary/columnar-database

Using DuckDB for geospatial analysis

The Spatial Extension for Duckdb leverages libraries such as GDAL to enable geospatial processing. Its ability to handle geodata and GIS formats parallels the functionalities of PostGIS but with the added advantage of utilizing DuckDB’s unique method of processing data.

Case study: Analyzing building data in Singapore

In this section, we’ll explore a practical example of DuckDB’s utility in geospatial data processing. As mentioned before, we will be using two different datasets:

  • The footprint of buildings in Malaysia & Singapore as polygons named buildings. We have 7.28 million polygons in this dataset.
  • Administrative boundaries of Singapore as a multi-polygon named aoi (area of interest).

The objective is to efficiently export all buildings within the aoi to a GeoJSON file using DuckDB. This will be achieved through a two-step SQL query process, leveraging DuckDB’s spatial analysis capabilities and the Python API for execution and benchmarking. SQL queries are used for data Extraction and Export.

1. Selection Query:

SELECT buildings.* FROM buildings, aoi WHERE ST_Within(buildings.geom, aoi.geom)

This SQL command selects all buildings that fall within the specified aoi and stores them in a dataframe named Sbuild (Selected Building). The time required for DuckDB to process this query was 5.35 seconds.

2. Export Query:

COPY Sbuild TO 'build.geojson' WITH (FORMAT GDAL, DRIVER 'GeoJSON')

Here, the selected dataframe of buildings within Singapore is exported to a GeoJSON file. The time needed for this query was 2.38 seconds.

As you can see, the speed at which the two queries were processed, especially considering the size of the dataset, showcases DuckDB’s capability to efficiently handle and export large spatial datasets.

The power of DuckDB for geospatial LLMs

As we come closer to releasing our standalone geospatial generative AI platform, finding new and innovative ways to handle and process large datasets has been one of our core priorities. Through our testing, we have come to understand and recognize the potential of DuckDB. Its integration into our web GIS project aligns with our commitment to deliver a user-friendly interface and empower users to unlock the full potential of geospatial technology without the need for extensive technical expertise.

Learn More and Get Involved

For more details about Geo+ and its features, visit our website or watch our demo video. You can also try Geo+ (requires ChatGPT 4). We welcome your feedback and support, and you can reach out to us through our contact form or email (contact@ageospatial.com). You can also follow us on Linkedin to be sure to obtain every update on our product!

Attributions

DuckDB: Raasveldt, M., & Muehleisen, H. (2024). DuckDB. Retrieved from https://github.com/duckdb/duckdb
ORCIDs:
Mark Raasveldt: https://orcid.org/0000-0001-5005-6844
Hannes Muehleisen: https://orcid.org/0000-0001-8552-0029

Building footprints: https://github.com/microsoft/IdMyPhBuildingFootprints. This data is licensed by Microsoft under the Open Data Commons Open Database License (ODbL).

And all our sources for Geo+: https://www.ageospatial.com/attributions

--

--

Ageospatial

Your Generative-AI compass for Geospatial data analysis.