DBT+Athena+Iceberg: How to use bucketing

Life-is-short--so--enjoy-it
4 min readFeb 25, 2024

--

The different syntax in Apache Iceberg bucket partition transformer.

DBT+Athena+Iceberg: How to use bucketing

Intro

When I experimented with the bucketing in the DBT + AWS Athena + Apache Iceberg Table, I noticed that the “bucket” partition transformer function did not work with CTAS. It failed.

After I reached out to AWS Support, I found out why it did not work.

This post is about the story.

Iceberg does not support bucketing ( not the way Hive does )

In short, Iceberg does not support Hive-style bucketing.

When I tried to use “bucketing” with the Iceberg Table, I tried these DBT-Athena configs in the model.

  • bucketed_by
  • bucket_count
{{
config(
table_type='iceberg',
materialized='incremental',
incremental_strategy='append',
s3_data_naming='schema_table_unique',
on_schema_change='append_new_columns',
bucketed_by=['event_type'],
bucket_count=10,
)
}}

The DBT model was built successfully, but the built model did not have anything related to the bucketing config.

Based on the DBT-Athena documentation, these bucketing configs are ignored if the table_type is Iceberg.

Yep. I did not read the DBT-Athena documentation carefully and missed it.

DBT runs successfully with
# ====
# The captured DDL from AWS Athena
# ====

CREATE TABLE ft_staging_analytics.zexperiments_iceberg_with_bucketing (
obj_id string,
event_type string,
event_timestamp timestamp)
LOCATION 's3://<bucket-name>/ft_staging_analytics/zexperiments_iceberg_with_bucketing/deb6f0cf-1c0a-4567-93b4-4ee8426a690a'
TBLPROPERTIES (
'table_type'='iceberg',
'format'='PARQUET',
'write_compression'='ZSTD'
);
src: https://github.com/dbt-athena/dbt-athena
AWS Athena + Iceberg: bucketed_by and bucket_count are not supported. src: https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html

Mimic bucketing in Iceberg with bucket partition transformer

The bucketing feature can be mimicked in Apache Iceberg Table with the “bucket” partition transformer.

-- DBT-Athena config
-- partitioned_by is used to specify the list of partition columns.
-- In this example, the `bucket` transformer is used.
{{
config(
table_type='iceberg',
materialized='incremental',
incremental_strategy='append',
s3_data_naming='schema_table_unique',
on_schema_change='append_new_columns',
partitioned_by=['bucket(10, event_type)'],
)
}}
Use Partitioning with bucket ransform in Iceberg Table. | src: https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html
In Apache Iceberg + AWS Athena, using partitioning property to specify partition columns. | src: https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html
Create Apache Iceberg Table in AWS Athena — src: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html

But, they failed to build the DBT model — Iceberg + Bucket Partitioning

The DBT model build failed with the error below.

For further investigation, I got the failed query from AWS Athena’s recent query list. ( It can be fetched from DBT’s complied query as well or DBT’s log )

The query looked ok to me, but AWS Athena did not like the submitted CTAS query.

NVALID_TABLE_PROPERTY: Unable to parse partitioning value: Invalid partition field declaration: bucket(10, event_type)
DBT-Athena: Failed to build a model using Iceberg with bucket partition transformer.
-- Captured the failed query from AWS Athena's recent query list
-- The `bucket_name` is one I replaced to hide my info.
create table "ft_staging_analytics"."zexperiments_iceberg_with_bucketing"
with (
table_type='iceberg',
is_external=false,
location='s3://<bucket_name>/ft_staging_analytics/zexperiments_iceberg_with_bucketing/766ab2f4-41be-4f6c-847f-27dc0542ffb7',
partitioning=ARRAY['bucket(10, event_type)'],
format='parquet'
)
as

SELECT
obj_id,
event_type,
CAST(event_timestamp AS TIMESTAMP(6)) AS event_timestamp
FROM source_table

Re-tried with CREATE TABLE syntax: AWS Athena + Iceberg + bucket partition transformer

I got curious if there was any syntax issue in using the bucket partition transformer. Therefore, I built a “CREATE TABLE” statement to test the bucket partition transformer.

The “CREATE TABLE” statement worked with the bucket(10, event_type)and I was able to confirm the created table was created with the partitioned I defined.

Ok. Something was wrong with the syntax in the bucket partition transformer.

-- ----
-- CREATE TABLE statement I used.
-- ----
CREATE TABLE ft_staging_analytics.zexperiments_iceberg_with_bucketing (
obj_id string,
event_type string,
event_timestamp TIMESTAMP
)
PARTITIONED BY (bucket(10, event_type))
LOCATION 's3://<bucket_name>/ft_staging_analytics/zexperiments_iceberg_with_bucketing/766ab2f4-41be-4f6c-847f-27dc0542ffb7'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' );
-- ----
-- The captured DDL statement
-- ----
CREATE TABLE default.zexperiments_iceberg_with_bucketing (
obj_id string,
event_type string,
event_timestamp timestamp)
PARTITIONED BY (bucket(10, `event_type`))
LOCATION 's3://forethought-athena-gatsby/athena_default_db/zexperiments_iceberg_with_bucketing'
TBLPROPERTIES (
'table_type'='iceberg'
);
CREATE TABLE in Iceberg: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html

Reached out to AWS Support, and got a reason why it did not work

As the last option, I reached out to AWS Support to get help with the results I tried. ( even including the working version with “CREATE TABLE” )

In short, in AWS Athena, CTAS uses Trino syntax ( AWS Athena is built on top of Trino ), but CREATE TABLE uses Hive syntax.

  • CREATE TABLE: Hive syntax
  • CREATE TABLE … SELECT AS ( aka. CTAS): Trino syntax

To make the bucket work in CTAS, the order of argument for the bucket has to be bucket(column_name, bucket_count)

  • In CREATE TABLE: bucket(bucket_count, column_name)
  • In CTAS: bucket(column_name, bucket_count)

Well, it was a little funny, but the different syntax is documented in an unclear way

I saw the different syntax in the bucket partition transformer in one of AWS Athena documentation pages like the screenshot below before I reached out to AWS Support.

I thought it was a typo or mis-documented. ( it happened several times. )

After I got the feedback from AWS Support, I was able to understand why the AWS Athena documentation has inconsistent documentation in the bucket partition transformer.

src: https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html

--

--

Life-is-short--so--enjoy-it

Gatsby Lee | Data Engineer | City Farmer | Philosopher | Lexus GX460 Owner | Overlander