DBT+Athena+Iceberg: How to use bucketing

4 min readFeb 25, 2024

The different syntax in Apache Iceberg bucket partition transformer.

DBT+Athena+Iceberg: How to use bucketing

Intro

When I experimented with the bucketing in the DBT + AWS Athena + Apache Iceberg Table, I noticed that the “bucket” partition transformer function did not work with CTAS. It failed.

After I reached out to AWS Support, I found out why it did not work.

This post is about the story.

Iceberg does not support bucketing ( not the way Hive does )

In short, Iceberg does not support Hive-style bucketing.

When I tried to use “bucketing” with the Iceberg Table, I tried these DBT-Athena configs in the model.

bucketed_by
bucket_count

{{
  config(
    table_type='iceberg',
    materialized='incremental',
    incremental_strategy='append',
    s3_data_naming='schema_table_unique',
    on_schema_change='append_new_columns',
    bucketed_by=['event_type'],
    bucket_count=10,
  )
}}

The DBT model was built successfully, but the built model did not have anything related to the bucketing config.

Based on the DBT-Athena documentation, these bucketing configs are ignored if the table_type is Iceberg.

Yep. I did not read the DBT-Athena documentation carefully and missed it.

# ====
# The captured DDL from AWS Athena
# ====

CREATE TABLE ft_staging_analytics.zexperiments_iceberg_with_bucketing (
  obj_id string,
  event_type string,
  event_timestamp timestamp)
LOCATION 's3://<bucket-name>/ft_staging_analytics/zexperiments_iceberg_with_bucketing/deb6f0cf-1c0a-4567-93b4-4ee8426a690a'
TBLPROPERTIES (
  'table_type'='iceberg',
  'format'='PARQUET',
  'write_compression'='ZSTD'
);

src: https://github.com/dbt-athena/dbt-athena

AWS Athena + Iceberg: bucketed_by and bucket_count are not supported. src: https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html

Mimic bucketing in Iceberg with bucket partition transformer

The bucketing feature can be mimicked in Apache Iceberg Table with the “bucket” partition transformer.

-- DBT-Athena config
-- partitioned_by is used to specify the list of partition columns.
-- In this example, the `bucket` transformer is used.
{{
  config(
    table_type='iceberg',
    materialized='incremental',
    incremental_strategy='append',
    s3_data_naming='schema_table_unique',
    on_schema_change='append_new_columns',
    partitioned_by=['bucket(10, event_type)'],
  )
}}

Use Partitioning with bucket ransform in Iceberg Table. | src: https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html

In Apache Iceberg + AWS Athena, using partitioning property to specify partition columns. | src: https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html

Create Apache Iceberg Table in AWS Athena — src: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html

But, they failed to build the DBT model — Iceberg + Bucket Partitioning

The DBT model build failed with the error below.

For further investigation, I got the failed query from AWS Athena’s recent query list. ( It can be fetched from DBT’s complied query as well or DBT’s log )

The query looked ok to me, but AWS Athena did not like the submitted CTAS query.

NVALID_TABLE_PROPERTY: Unable to parse partitioning value: Invalid partition field declaration: bucket(10, event_type)

DBT-Athena: Failed to build a model using Iceberg with bucket partition transformer.

-- Captured the failed query from AWS Athena's recent query list
-- The `bucket_name` is one I replaced to hide my info.
 create table "ft_staging_analytics"."zexperiments_iceberg_with_bucketing"
  with (
    table_type='iceberg',
    is_external=false,
    location='s3://<bucket_name>/ft_staging_analytics/zexperiments_iceberg_with_bucketing/766ab2f4-41be-4f6c-847f-27dc0542ffb7',
    partitioning=ARRAY['bucket(10, event_type)'],
    format='parquet'
  )
  as

SELECT
  obj_id,
  event_type,
  CAST(event_timestamp AS TIMESTAMP(6)) AS event_timestamp
FROM source_table

Re-tried with CREATE TABLE syntax: AWS Athena + Iceberg + bucket partition transformer

I got curious if there was any syntax issue in using the bucket partition transformer. Therefore, I built a “CREATE TABLE” statement to test the bucket partition transformer.

The “CREATE TABLE” statement worked with the bucket(10, event_type)and I was able to confirm the created table was created with the partitioned I defined.

Ok. Something was wrong with the syntax in the bucket partition transformer.

-- ----
-- CREATE TABLE statement I used.
-- ----
CREATE TABLE ft_staging_analytics.zexperiments_iceberg_with_bucketing (
    obj_id string,
    event_type string,
    event_timestamp TIMESTAMP
)
PARTITIONED BY (bucket(10, event_type))
LOCATION 's3://<bucket_name>/ft_staging_analytics/zexperiments_iceberg_with_bucketing/766ab2f4-41be-4f6c-847f-27dc0542ffb7'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' );

-- ----
-- The captured DDL statement
-- ----
CREATE TABLE default.zexperiments_iceberg_with_bucketing (
  obj_id string,
  event_type string,
  event_timestamp timestamp)
PARTITIONED BY (bucket(10, `event_type`))
LOCATION 's3://forethought-athena-gatsby/athena_default_db/zexperiments_iceberg_with_bucketing'
TBLPROPERTIES (
  'table_type'='iceberg'
);

CREATE TABLE in Iceberg: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html

Reached out to AWS Support, and got a reason why it did not work

As the last option, I reached out to AWS Support to get help with the results I tried. ( even including the working version with “CREATE TABLE” )

In short, in AWS Athena, CTAS uses Trino syntax ( AWS Athena is built on top of Trino ), but CREATE TABLE uses Hive syntax.

CREATE TABLE: Hive syntax
CREATE TABLE … SELECT AS ( aka. CTAS): Trino syntax

To make the bucket work in CTAS, the order of argument for the bucket has to be bucket(column_name, bucket_count)

In CREATE TABLE: bucket(bucket_count, column_name)
In CTAS: bucket(column_name, bucket_count)

Well, it was a little funny, but the different syntax is documented in an unclear way

I saw the different syntax in the bucket partition transformer in one of AWS Athena documentation pages like the screenshot below before I reached out to AWS Support.

I thought it was a typo or mis-documented. ( it happened several times. )

After I got the feedback from AWS Support, I was able to understand why the AWS Athena documentation has inconsistent documentation in the bucket partition transformer.