Querying external semi-structured data from AWS S3 with dbt_external_tables package and AWS Redshift Spectrum

Léo Carlos Michel Neto
Indicium Engineering
12 min readNov 4, 2022

--

This article will teach you how to build a data warehouse with dbt using Redshift Spectrum to query data directly from S3, saving tons of storage, optimizing compute processing and thus reducing costs.

By using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to run very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3.

Multiple clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster. If you are building a data warehouse you can definitely benefit from this, integrating Spectrum to your models.

In this article we are going to show you how we have successfully merged Redshift Spectrum into our dbt models for storage and processing optimization by querying our data directly on Amazon S3 with external schemas and tables.

What you will need to make it happen

  1. Redshift Cluster and SQL Client configured
  2. S3 data and cluster on the same region (how to move your data here: link)
  3. Data Catalog in Athena or Glue.

We have used Glue in our project, so that is what you are going to find in this guideline. You can see how to upgrade to Glue here: Integration with AWS Glue or read more about Spectrum with Athena here: What is Amazon Athena? and Create an external schema and an external table.

Even though, storing and querying your data on S3 through this method is normally cheaper than doing it so on Redshift, there are still expenses that you should consider:

Besides the S3 rates for storing objects in your S3 buckets, you will be charged for all requests made against them by Redshift Spectrum. With Redshift Spectrum, you are billed per terabyte of data scanned, with a 10 MB minimum per query.

For example, if you scan 10 GB of data, you will be charged $0.05. If you scan 1 TB of data, you will pay $5.00. In addition, you will also be charged for the Redshift cluster and the Glue Data Catalog. With the Glue Data Catalog, you can store up to a million objects free of charge. If you store more than a million objects, you will be charged $1.00 per 100,000 objects over a million, per month. An object in the AWS Glue Data Catalog is a table, table version, partition, or database.

For better understanding of our costs we have used the AWS Pricing Calculator (AWS Pricing Calculator) and we recommend you to do the same.

Step-by-step guide

To build this infrastructure we will go through five steps, from granting access until querying your data on S3, that will be detailed in the next section.

The steps to setup Redshift Spectrum and run your models on dbt are the following:

  1. Create an IAM role for access of the S3 bucket
  2. Associate the role policy with your cluster
  3. Create the Data Catalog on Glue and the external schema
  4. Create the external tables with dbt
  5. Run your models on top of S3 with dbt

Step 1 — Create an IAM role

Redshift needs an authorization to access the Glue Data Catalog and the data on S3. Therefore, you will need an IAM role for your Redshift cluster that has an Amazon Resource Name (ARN) associated.

To create an IAM role for Amazon Redshift:

  1. Open the IAM console.
  2. In the navigation pane, choose Roles.
  3. Choose Create role.
  4. Choose AWS service as the trusted entity, and then choose Redshift as the use case.
  5. Under Use case for other AWS services, choose Redshift — Customizable and then choose Next.
  6. The Add permissions policy page appears. Choose ‘AmazonS3ReadOnlyAccess’ and ‘AWSGlueConsoleFullAccess’. Choose Next.

More information here: Create an IAM role for Amazon Redshift

Step 2 — Associate the role policy with your cluster

After creating an IAM role that authorizes Amazon Redshift to access the external Data Catalog and Amazon S3, you must associate that role with your Amazon Redshift cluster.

To associate an IAM role with a cluster:

  1. Sign in to AWS Management Console and open Amazon Redshift console at https://console.aws.amazon.com/redshift/.
  2. On the navigation menu, choose “Clusters”, choose the name of the cluster you want to update.
  3. For “Actions”, choose “Manage IAM roles”. The IAM roles page appears.
  4. Either choose “Enter ARN” and then enter an ARN or an IAM role, or choose an IAM role from the list. Then choose “Add IAM role” to add it to the list of Attached IAM roles.
  5. Choose “Done” to associate the IAM role with the cluster. The cluster is modified to complete the change.

More information here: Associate the IAM role with your cluster

Step 3 — Create a data catalog and an external schema

When you choose to query your data on S3, it is important to organize it in a way that Redshift will understand. This is the reason why Redshift Spectrum uses a data catalog. A data catalog is where all the metadata related to the database and external schemas will be stored.

The Athena Data Catalog is the standard service used by Redshift Spectrum, but you can change it to Glue or EMR for example. In this guideline we will be using Glue, therefore we strongly recommend upgrading to it. Here is a guide on how to do it: Upgrading to the AWS Glue Data Catalog step-by-step.

Due to the external tables need to be in an external schema, you will need to build one first. You can add the CREATE EXTERNAL DATABASE IF NOT EXISTS statement at the end of the CREATE EXTERNAL SCHEMA script to build the Glue Data Catalog at the same time.

You can use your database manager for this step — we have used DBeaver. The following statement will create an external schema called EXTERNAL_SCHEMA_NAMEand a data catalog called GLUE_CATALOG_NAMEL. For more information on this statement, including all possible variables, check out this link: CREATE EXTERNAL SCHEMA — Amazon Redshift.

create external schema EXTERNAL_SCHEMA_NAME from data catalog
database 'GLUE_CATALOG_NAME'
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
create external database if not exists

Note: The owner of this schema is the issuer of the CREATE EXTERNAL SCHEMA command. To transfer ownership of an external schema, use ALTER SCHEMA to change the owner. To grant access to the schema to other users or user groups, use the GRANT command.You can’t use the GRANT or REVOKE commands for permissions on an external table. Instead, grant or revoke the permissions on the external schema.

Step 4 — Create the external tables with dbt

Now that we’ve set our external schema and the IAM role associated with it, it’s time to create the external tables using the dbt package dbt_external_tables.

This package, developed by dbt labs team, will allow us to CREATE external tables, REFRESHpartitions, DROPand ALTERexternal tables within Amazon Redshift, using the metadata provided in the .yml file source definitions.

For a better understanding let’s see exactly what DDL statements dbt_external_tables must compile and write in order to Amazon Spectrum stage and refresh our external sources:

create external table "DATABASE_NAME"."EXTERNAL_SCHEMA_NAME"."EXTERNAL_TABLE_NAME" (
"_airbyte_ab_id" varchar,
"_airbyte_emitted_at" int,
"_airbyte_data" super )
partitioned by ("PARTITION_NAME" date)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties ('case.insensitive' = 'FALSE')
location 's3://OBJECTS_TO_BE_PARTITIONED_PATH'
alter table "DATABASE_NAME"."EXTERNAL_SCHEMA_NAME"."EXTERNAL_TABLE_NAME" add if not EXISTS
partition (PARTITION_NAME=’2022–09–06')
location ‘s3://OBJECTS_TO_BE_PARTITIONED_PATH/2022–09–06/’
partition (PARTITION_NAME=’2022–09–07')
location ‘s3://OBJECTS_TO_BE_PARTITIONED_PATH/2022–09–07/’
partition (PARTITION_NAME=’2022–09–08')
location ‘s3://OBJECTS_TO_BE_PARTITIONED_PATH/2022–09–08/’

In this example, we are using Airbyte to extract semi-structured data from an API and loading them into S3 buckets. Objects are stored in S3 as .jsonl files. As it will be explained in Step 5, we will use PartiQL to unnest our raw data and create dbt models.

To do so, we need to use OpenX JSON SerDe AWS Athena library to transform the raw object in S3 into a tabular object with 3 columns, and store the one which contains the data as a SUPER data type column.

  • _airbyte_ab_id → Hash primary key generated by airbyte. Unique for each API object response.
  • _airbyte_emitted_at → Epoch timestamp generated by airbyte.
  • _airbyte_data → Raw data from API. This column needs to be stored as super.

This configuration depends on the file type that your object is stored, and can be set right after the row format clause. If your objects, for example, are stored as .csv files, then use the LazySimpleSerDe library instead of OpenX Json.

The data is extracted and loaded in a daily routine, or in other words, we’ve got multiple folders inside OBJECTS_TO_BE_PARTITIONED_PATH named as the day the data extraction and load were made, and each of them containing data for that specific day.

Therefore, the partition explained above is defined right after the partitioned by clause, as it is the data type. And this is how Redshift Spectrum adds new data to the external table, e.g., by calling alter table to the folder in S3 that corresponds to the desired partition, all of the data inside this folder will be appended to the external table.

To illustrate this process, we created partitions for data that were extracted and loaded between 2022–09–06 and 2022–09–08.

Note: There are many ways to partition your data. Read “Partitioning Redshift Spectrum external tables” section in AWS Redshift docs.

Replace all uppercase words with your configuration and S3 objects path, and then run it all in Redshift Query Editor or any database tool of your choice. Note that a table named EXTERNAL_TABLE_NAME is created in your EXTERNAL_SCHEMA_NAME schema.

Now that you have fully understood how Redshift Spectrum reads cataloged data in AWS Glue from S3 buckets, let’s implement this process in dbt, using the dbt_external_tables package.

First, add the following in your packages.yml and then run dbt deps to install the package.

*Note: Keep in mind to track package compatibility. Different versions of dbt-core and/or dbt packages may result in a package conflict.

Then, create a .yml that will receive all of the configurations shown in the previous DDL statement.

We are using two macros to generate the add partition clause called by alter table statement. The first one, dbt.dates_in_range, generates a list of dates, according to the arguments passed, which are start_date_str, end_date_str, in_fmt and out_fmt.

The second one, dbt_external_tables.value_only writes the full path of each partition, given the list generated by dbt.dates_in_range, that means, of each partition. Depending on how your S3 folders are named, you may want to change how dbt writes the list of add partition clauses. Check out for the available options in dbt_external_tables path.sql repo.

Now that your .yml file is properly configured, run the following in terminal:

dbt run-operation stage_external_sources

Note that a table named EXTERNAL_TABLE_NAMEis created in your EXTERNAL_SCHEMA_NAME schema.

*Hint: To evaluate if dbt_external_tables is writing the DDL statement as you wish, checkout the STL_DDLTEXT table within Redshift.

Working in a production environment

According to the Redshift Spectrum docs:

If you use the AWS Glue catalog, you can add up to 100 partitions using a single ALTER TABLE statement.

Fortunately, dbt_external_tables package already solve this AWS Glue limitation by generating a new ALTER TABLE statement when the 100 partitions limit is outranged.

With that being said, there multiple ways to set start_date_str and/or end_date_str in a way to automatically add new partitions for each dbt run-operation stage_external_sources execution:

1. Defining them using dbt Jinja env_var() function to incorporate Environment Variables from the system into your .yml file:

2. Defining them using a python script that always select the 3 most recent partitions, as it follows:

Thereby, for each dbt run-operation stage_external_sources, the external tables will receive new data from S3, according to your partition definition. After the macro finishes its process, you can call your source external table in a dbt model using the macro source().

Now that we have properly configured the .yml source file, it’s time to transform data!

Step 5 — Run your models with dbt and PartiQL syntax

Amazon Redshift uses PartiQL language to offer SQL-compatible access to relational, semistructured, and nested data, making it easy to efficiently query data, regardless of where or in what format it is stored.

In a few words, PartiQL will enable us to navigate into arrays and structures using the bracket and dot notation respectively, giving us power to create models without using lots and lots of nested json_extract() functions.

Here is how PartiQL can be used. First of all, we are trying unnest this type of API response (If you configure your external table as described above, this data will be inside _airbyte_data column):

In order to fully unnest this data and create a tabular object with it, simply write this query:

*Hint: Applying array() function on _airbyte_data column, before running this query, allows you to iterate over both structs and arrays in the same query and also to define intermediate struct/array paths, by using the from clause, as was done here with venue_column.

As you can see, PartiQL has a very simple and even trivial syntax. However, the same logic of unnesting semi-structured data applies here, so you have to be careful to not generate unexpected duplicates.

For example, if the venue struct had 2 entries of venue id, venue name, venue city and venue state, the above query execution would give us one row for each venue group, and so, duplicating all of the other column values.

If you have any further questions about PartiQL syntax, this tutorial from Redshift can help you with a few other examples of querying this type of data.

Last but not least, we are going to implement this amazing tool into our dbt project. First of all, let’s specify our transformation layers, giving us a macro insight of what we are about to do:

sources

stagings

  • Unnest data.
  • Create new primary keys, joins tables if needed and general data transformations.
  • Materialize, primarily as view, or as table using dbt incremental strategy when necessary.

marts

  • Available to end-users.
  • Join tables according to star-schema modeling (for example).
  • Formalize business rules.
  • Materialize as table.

Since our data pipeline using external tables to stage raw data from S3 and PartiQL to unnest this data affects mostly the sources and staging layer, we will not cover up the marts layer in this article. However, if any question arises while developing your model, feel free to DM or send us an email.

The sources layer

As mentioned before, applying array() function on the raw data column allows you to iterate over both structs and arrays in the same query. Thereby, it comes handy to use this function on our source table.

Even though the main purpose of this structure (and also this article 😀) is to reduce the need of rescaling a Redshift cluster due to storage constraints, materializing models in this layer as views can significantly upscale the amount of time needed to eventually materialize a model that calls another model from sources layer as a table (such as it happens in marts’ models).

With that being said, we highly recommend to materialize tables in this layer using the dbt incremental strategy, so that only new data from the external table gets materialized for each dbt run.

The stagings layer

This layer will contain most of the data transformation in order to create efficient and trustful tabular models that were incremented on the sources layer. Creating new primary keys, joining tables and applying Redshift optimizations are a few of them.

The strategy of materialization will depend on the complexity of the unnesting process using the PartiQL syntax, specifically on how many columns will be created. The recommendation here is to test it! Materialize the model both as a view and using the incremental stategy, then evaluate if the running time is acceptable according to your pipeline needs.

But again, we are prioritizing Redshift storage, so always take materializing models as views as a first option. In our example, we will generate only 11 columns and 200.000 rows, therefore, we will materialize the example dbt model as a view.

Final words

Now you should have a clearer idea of how to use external tables to access raw data from Amazon S3 with Amazon Redshift Spectrum using the dbt_external_tables package and also how to create tabular models from semi-structured data using the PartiQL syntax.

Hope you like this post! Comment if you have any questions, suggestions or even corrections to the text!

Authors

Léo Carlos Michel Neto and Isadora de Borba Busch are Brazilians Analytics Engineers working at Indicium, helping companies to turn data into powerful insights and data-driven decisions.

--

--