Doing More With Less: Using DBT to load data from AWS S3 to Snowflake via External Tables

chynna calip
SlateCo
Published in
3 min readAug 9, 2021

This is a guide to walk you through the loading of data from AWS to snowflake using external tables and DBT, with no additional tooling.

Step 1: Create an external stage in snowflake

This external stage will reference the files that are in the Amazon S3 bucket, for our example all files will be CSV. More information on stages in snowflake.

External stages are schema specific, so choose the database and schema which this stage will be connected to . Use this sql below to create the stage in snowflake, adjust based on type of file.

*Sorry for using spaces for indents, had to!*

USE SCHEMA "DATABASE_NAME.SCHEMA_NAME";
CREATE OR REPLACE STAGE [NAME_OF_STAGE]
url='s3://[INSERT_PATH_HERE]'
credentials=(
aws_key_id='###'
aws_secret_key='###')
file_format = (type = csv); -- Optional to add more file format configurations here (for example: field_delimeter)

Now your data should be visible in your stage, here’s some SQL to check. Make sure to USE the schema your stage is in.

  1. List all the stages under the schema:SHOW STAGES;
  2. List the files in a given stage:LIST @[NAME_OF_STAGE];
  3. Show in the filename, row number and columns names for files in a given stage ($1 is the first column, etc):
SELECT 
metadata$filename,
metadata$file_row_number,
t.$1,
t.$2,
t.$3
FROM @[NAME_OF_STAGE] t;

4. Show the filename and count of rows in the files in a given stage:

SELECT
METADATA$FILENAME AS FILE_NAME_PATH,
MAX(METADATA$FILE_ROW_NUMBER) AS NUM_OF_ROWS
FROM @[NAME_OF_STAGE]
GROUP BY METADATA$FILENAME;

Step 2: Let’s DBT

  1. Put DBT external table package into your packages.yml
- package: dbt-labs/dbt_external_tables
version: [input most recent version]

2. Create a yaml file with a descriptive name, i.e.ext_stage_source.yml. DBT will reference this file in the dbt_external_tables package.

The next step will create an external table from your external stage. More information here:

You can include more configurations, see reference here:

3. Run DBT stage_external_sources macro to create external tables from the data/files from your stage, either standard refresh or full refresh. (Can run this manually or in recurring jobs)

# standard refresh
dbt run-operation stage_external_sources
# full refresh (create or replace)
dbt run-operation stage_external_sources --vars "ext_full_refresh: true"

4. Include your created external tables into your sources yml file.

sources:
- name: [INPUT_SOURCE_NAME_HERE]
database: [NAME_OF_DATABASE_OF_EXT_TABLE_HERE]
schema: [NAME_OF_SCHEMA_OF_EXT_TABLE_HERE]
tables:
- name: [NAME_OF_FIRST_EXTERNAL_TABLE_HERE]
- name: [NAME_OF_SECOND_EXTERNAL_TABLE_HERE]

5. Now you can use this as part of your table references. I recommend creating a source table for each external table. It can take a lot of time/processing to run queries off a large external table, loading to an intermediate table can reduce some of that friction.

Article References:

Work With Us

SLATE is an engineering company with a focus on modern approaches to data architecture and software development, and delivers custom solutions that allow organizations to gain insights to drive effective decision making. If you have a problem that you’d like us to help you with, then you can contact us via our website!

--

--

chynna calip
SlateCo
Editor for

Hi my name is Chynna as in ‘CHEEEEEEEEE-NAH’ and I’m no pro, but I do like data…and desserts.