DBT+Athena+Iceberg: How to use bucketing
The different syntax in Apache Iceberg bucket partition transformer.
Intro
When I experimented with the bucketing in the DBT + AWS Athena + Apache Iceberg Table, I noticed that the “bucket” partition transformer function did not work with CTAS. It failed.
After I reached out to AWS Support, I found out why it did not work.
This post is about the story.
Iceberg does not support bucketing ( not the way Hive does )
In short, Iceberg does not support Hive-style bucketing.
When I tried to use “bucketing” with the Iceberg Table, I tried these DBT-Athena configs in the model.
- bucketed_by
- bucket_count
{{
config(
table_type='iceberg',
materialized='incremental',
incremental_strategy='append',
s3_data_naming='schema_table_unique',
on_schema_change='append_new_columns',
bucketed_by=['event_type'],
bucket_count=10,
)
}}
The DBT model was built successfully, but the built model did not have anything related to the bucketing config.
Based on the DBT-Athena documentation, these bucketing configs are ignored if the table_type is Iceberg.
Yep. I did not read the DBT-Athena documentation carefully and missed it.
# ====
# The captured DDL from AWS Athena
# ====
CREATE TABLE ft_staging_analytics.zexperiments_iceberg_with_bucketing (
obj_id string,
event_type string,
event_timestamp timestamp)
LOCATION 's3://<bucket-name>/ft_staging_analytics/zexperiments_iceberg_with_bucketing/deb6f0cf-1c0a-4567-93b4-4ee8426a690a'
TBLPROPERTIES (
'table_type'='iceberg',
'format'='PARQUET',
'write_compression'='ZSTD'
);
Mimic bucketing in Iceberg with bucket partition transformer
The bucketing feature can be mimicked in Apache Iceberg Table with the “bucket” partition transformer.
-- DBT-Athena config
-- partitioned_by is used to specify the list of partition columns.
-- In this example, the `bucket` transformer is used.
{{
config(
table_type='iceberg',
materialized='incremental',
incremental_strategy='append',
s3_data_naming='schema_table_unique',
on_schema_change='append_new_columns',
partitioned_by=['bucket(10, event_type)'],
)
}}
But, they failed to build the DBT model — Iceberg + Bucket Partitioning
The DBT model build failed with the error below.
For further investigation, I got the failed query from AWS Athena’s recent query list. ( It can be fetched from DBT’s complied query as well or DBT’s log )
The query looked ok to me, but AWS Athena did not like the submitted CTAS query.
NVALID_TABLE_PROPERTY: Unable to parse partitioning value: Invalid partition field declaration: bucket(10, event_type)
-- Captured the failed query from AWS Athena's recent query list
-- The `bucket_name` is one I replaced to hide my info.
create table "ft_staging_analytics"."zexperiments_iceberg_with_bucketing"
with (
table_type='iceberg',
is_external=false,
location='s3://<bucket_name>/ft_staging_analytics/zexperiments_iceberg_with_bucketing/766ab2f4-41be-4f6c-847f-27dc0542ffb7',
partitioning=ARRAY['bucket(10, event_type)'],
format='parquet'
)
as
SELECT
obj_id,
event_type,
CAST(event_timestamp AS TIMESTAMP(6)) AS event_timestamp
FROM source_table
Re-tried with CREATE TABLE syntax: AWS Athena + Iceberg + bucket partition transformer
I got curious if there was any syntax issue in using the bucket
partition transformer. Therefore, I built a “CREATE TABLE” statement to test the bucket
partition transformer.
The “CREATE TABLE” statement worked with the bucket(10, event_type)
and I was able to confirm the created table was created with the partitioned I defined.
Ok. Something was wrong with the syntax in the bucket
partition transformer.
-- ----
-- CREATE TABLE statement I used.
-- ----
CREATE TABLE ft_staging_analytics.zexperiments_iceberg_with_bucketing (
obj_id string,
event_type string,
event_timestamp TIMESTAMP
)
PARTITIONED BY (bucket(10, event_type))
LOCATION 's3://<bucket_name>/ft_staging_analytics/zexperiments_iceberg_with_bucketing/766ab2f4-41be-4f6c-847f-27dc0542ffb7'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' );
-- ----
-- The captured DDL statement
-- ----
CREATE TABLE default.zexperiments_iceberg_with_bucketing (
obj_id string,
event_type string,
event_timestamp timestamp)
PARTITIONED BY (bucket(10, `event_type`))
LOCATION 's3://forethought-athena-gatsby/athena_default_db/zexperiments_iceberg_with_bucketing'
TBLPROPERTIES (
'table_type'='iceberg'
);
Reached out to AWS Support, and got a reason why it did not work
As the last option, I reached out to AWS Support to get help with the results I tried. ( even including the working version with “CREATE TABLE” )
In short, in AWS Athena, CTAS uses Trino syntax ( AWS Athena is built on top of Trino ), but CREATE TABLE uses Hive syntax.
- CREATE TABLE: Hive syntax
- CREATE TABLE … SELECT AS ( aka. CTAS): Trino syntax
To make the bucket
work in CTAS, the order of argument for the bucket
has to be bucket(column_name, bucket_count)
- In CREATE TABLE: bucket(bucket_count, column_name)
- In CTAS: bucket(column_name, bucket_count)
Well, it was a little funny, but the different syntax is documented in an unclear way
I saw the different syntax in the bucket
partition transformer in one of AWS Athena documentation pages like the screenshot below before I reached out to AWS Support.
I thought it was a typo or mis-documented. ( it happened several times. )
After I got the feedback from AWS Support, I was able to understand why the AWS Athena documentation has inconsistent documentation in the bucket
partition transformer.