Quacking Queries in the Azure Cloud with DuckDB

Published in

datamindedbe

7 min readJan 10, 2024

I have been using DuckDB for over a year but most of my work has been focused on using it on AWS, which is why in this post I want to explicitly dive into the integration with Azure. If you want to understand why and when you should use DuckDB, I would recommend my previous blog post.

In this post, I will show how to use 2 different Azure extensions when using Python and dbt. Afterward, I will compare the performance of both extensions when reading from Azure blob storage.

In my job as a data engineer, I typically write data pipelines that transform data residing in a data lake. Integration with blob storage is therefore one of the most important aspects when evaluating a data processing tool like DuckDB. There are 2 ways to interact with Azure blob storage using DuckDB, namely: using the fsspec interface or the native duckdb-azure extension.

Duck data input

In order to compare the different extensions, I ran several queries on the TPC-DS data. This is the same dataset as I have used in previous blogposts, which measured the performance of DuckDB/Spark/… In this case, I generated 100Gb (using scale factor 100) of TPC-DS input data and stored it on Azure blob storage. You can take a look at the steps described in my benchmark repository if you are interested in generating the data yourself.

Fsspec unplucked

The Fsspec is an open-source specification on how to interact with a filesystem using Python. It integrates out-of-the-box with a wide variety of systems, but there are also many external providers that implement the specification. For example, the 3 major cloud providers implement the specification, called: gcs, adls and s3fs. A full list of external implementations can be found here.

Since DuckDB 0.7.0, you can thus integrate with any filesystem that has an implementation of the spec. The implementation for Azure blob storage is called adls and it supports both Azure blob storage generation 1 & 2. Because the fsspec is implemented in Python, it can only be used when using the python client of DuckDB. This is not a big issue when writing data pipelines, as these are primarily written in Python anyway.

Using the python client

First, you need to install the fsspec, adls and DuckDB dependencies using pip or another package installer. Now, you are ready to start querying the inventory data in the TPC-DS dataset:

import duckdb
from adlfs.spec import AzureBlobFileSystem

connection = duckdb.connect()
connection.register_filesystem(AzureBlobFileSystem(account_name='<storage_account>', connection_string='<storage_account_key>'))


print("count the number of records in the inventory")
query = connection.sql('''
  SELECT * FROM read_parquet('abfs://inventory/*.parquet')
''')
print(query.fetchall())

To read data from Azure blob storage generation 2, you need to use the abfs:// or az:// prefix whereas if you are still using generation 1, you have to prefix your blob with the adls:// prefix. In this example, I use a storage account key to authenticate, which I do not recommend for production pipelines but is fine for the purpose of this blogpost. The fsspec library supports all Azure authentication mechanisms as it uses the Azure python sdk under the hood, for more details look here.

Using dbt-duckdb

I am a big fan of using Ducdkb with dbt as it allows you to write plain SQL instead of writing python. Using the Fsspec interface in this setup is very similar as described in the previous section. Let’s start by installing the necessary libraries: fsspec, adls and dbt-duckdb, which is the DuckDB adapter for dbt. After installing the python packages, you need to create a profile in your profiles.yml file:

default:
  outputs:
    fsspec:
      type: duckdb
      path: /tmp/dbt.duckdb
      schema: dbt
      extensions:
        - parquet
      filesystems:
        - fs: abfs
          anon: false
          account_name: "<storage-account-name>"
          account_key: "{{ env_var('DBT_ENV_SECRET_STORAGE_ACCOUNT_KEY') }}"

For more information on the supported properties, take a look a the dbt-DuckDB docs. When you prefix an environment variable with DBT_ENV_SECRET_ dbt makessure the secret is not shown in the logs nor in any error messages, more details can be found here.

The next step is to specify your sources in the sources.yml, which in this case is the inventory data on Azure blob storage:

version: 2
sources:
  - name: external_source
    meta:
      external_location: "abfs://{name}/*.parquet"
    tables:
      - name: inventory

Now you have everything ready to create your dbt model:

{{ config(materialized='external', location='abfs://inventory/inventory_count.parquet')}}

WITH inventory AS (
    select * from {{ source('external_source', 'inventory') }}
)

SELECT count(*) FROM inventory

We query again the inventory data of the TPC-DS dataset, but now we also write the output of our model to Azure blob storage. This is controlled by the external materialization feature of dbt-duckdb, which also includes the option to specify the location. For more information about other materialization options, you can read the dbt docs.

Dabbled in duckdb-azure

The duckdb-azure extension was released with DuckDB 0.9.0 in September 2023 and is still flagged as experimental. The goal of the extension is to have a native way to interact with Azure blob storage. The extension can be used by any DuckDB client instead of only the Python one.

Using the python client

In this case, we only need to install the DuckDB using Pip or another package installer. Executing our previous SQL query can be done as follows:

import duckdb

connection = duckdb.connect()
connection.execute('''INSTALL azure''')
connection.execute('''LOAD azure''')
connection.execute('''SET azure_storage_connection_string='<storage-connection-string>';''')

print("count the number of recors in the inventory")
query = connection.sql('''SELECT * FROM read_parquet('azure://inventory/*.parquet')''')
query.fetchall()

As you can see in the example, the blob storage prefix, in this case, should be azure://

If you want to run this code on Ubuntu, you will most likely get an error mentioning an issue with the SSL CA cert. This issue is discussed here and the current workaround is executing the following statements as root:

mkdir -p /etc/pki/tls/certs
ln -s /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt

At the moment the extension only supports authentication using storage account keys but in the code on github, they have already added support for the Azure credential chain. This probably means that all authentication mechanisms will be supported in the next release.

Using dbt-duckdb

In order to use the duckdb-azure extension with dbt-duckdb, we need to change 3 things from the example using fsspec.

We need to change the profiles.yml to use the Azure extension instead of the filesystem mechanism:

default:
  outputs:
    native:
      type: duckdb
      path: /tmp/dbt.duckdb
      schema: dbt
      extensions:
        - azure
        - parquet
      settings:
        azure_storage_connection_string: "{{ env_var('DBT_ENV_SECRET_STORAGE_CONNECTION_STRING') }}"

The second thing we need to change is to alter the storage prefix to azure://
Remove the external materialization as the Azure extension does not support writing to Azure blob storage yet.

Quack-Time Analysis

In order to measure the performance of both approaches, I set up some experiments using 3 different scenarios:

Query 1 small file (+-10kb)
Query 1 large file (+-100Mb)
Query 1000 small files (+-10Kb)

I use the same query for every experiment:

SELECT * FROM read_parquet('azure://dir/*.parquet') order by col limit 10

The query focuses the interaction between DuckDB and Azure blob storage as most of the effort is spent in reading all the data.

Based on these results 2 conclusions can be drawn:

The native Azure extension performs better. This can probably be attributed to the fact that there is less overhead because there is no need to convert between Python and C++ objects.
There seems to be a performance issue in the fsspec when using many small files as the performance degrades faster than in the native extension.

Poor performance of fsspec on many files

The poor performance when using many small files cannot be explained by conversions between Python objects and c++ as we do not see a similar performance hit when processing 1 large file.

As a next step, I ran some additional experiments with 10, 100, 1000, 10000 identical files in a given directory. In each scenario fsspec took double the time that the native Azure extension required.

To understand what might cause this, I enable debug logging when using fsspec, as it exposes the raw http requests performed by the Azure python sdk. In the logs I saw the same pattern over and over again:

...
Request URL: 'https://11.parquet'
Request method: 'HEAD'
Request headers:
No body was attached to the request
Request URL: 'https://11.parquet'
Request method: 'GET'
Request headers:
No body was attached to the request
...

This shows that after finding all the files that need to be downloaded, it checks for file whether it exists before starting the download. This is totally unnecessary and results in +-20ms performance hit per download. If you need to process 1000 files, this results in +-20 seconds and thus explains the performance penalty.

In order to validate this theory, I ended up opening Wireshark to measure the network traffic between my ip and the ip of Azure blob storage service. The content of the request cannot be inspected, as it is encrypted, but we can see the number of requests: 79 vs 155. This clearly shows that the fsspec library does double the amount of requests compared to the native Azure extension. A huge shout-out to Jan Vanbuel for collaborating with me during the investigation.

Conclusion

In this blogpost I showed how you can use DuckDB to query data residing on Azure blob storage. I compared the usage of the fsspec interface with the native Azure extension of DuckDB. At the moment you should use the fsspec interface for running production use-cases. The evaluation of the current state of both extensions is as follows:

Keep an eye out for future releases of the native azure extension as it shows great potential to fix all shortcomings with fsspec.

I hope this helps you get started with DuckDB on Azure. Do you have any other topics you’d like me to cover with respect to DuckDB and/or Azure? Feel free to drop a comment below this post.

Quacking Queries in the Azure Cloud with DuckDB

Duck data input

Fsspec unplucked

Using the python client

Using dbt-duckdb

Dabbled in duckdb-azure

Using the python client

Using dbt-duckdb

Quack-Time Analysis

Poor performance of fsspec on many files

Conclusion

Written by Niels Claeys