Loading billions of Braze Events into Snowflake in a cost effective and scalable way

Kirankumar Desireddy
Zip Technology
Published in
5 min readMay 30, 2023

At Zip, we have been migrating to a new modern data platform. The migration journey from Athena to Snowflake was progressing well until we encountered an obstacle while loading Braze Currents data. For those unfamiliar with Braze Currents, it is a vital component of our Customer Engagement Platform, crucial for comprehending the effects of our engagement strategy and enhancing communication with our customers. Braze Currents captures a diverse range of real-time event data, as illustrated below, and in our case, exports it to an S3 bucket for further analytics and data science workloads.

Braze Currents Events

Zip’s data platform and the challenges we faced

Zip’s data platform consists of a combination of paid SaaS and Open Source tooling including Snowflake, dbt, Fivetran, Airbyte, Airflow, Kafka and Snowplow. Our infrastructure is hosted on AWS.

The challenge we encountered went beyond technical difficulties; we needed to find a solution that was both scalable in performance without a significant rise in costs. The existing framework we have in place for loading external data leverages Fivetran, which allows us to quickly onboard new datasets using pre-built data models. Initially, we considered using Fivetran’s connectors for Braze Currents ingestion, but realised this would be cost prohibitive for us with the volumes we were looking at. We explored several alternatives before settling on our current setup which we selected due to its simplicity, scalability, and cost-effectiveness. This approach is outlined in detail below.

Braze Currents <> Snowflake Solution Architecture

Below are the steps in detail:

Step 1 — Braze configuration

There are multiple options to integrate the Braze Currents data first into cloud data storage like Amazon S3, Azure Blob Storage, Google Cloud Storage etc. Since our infrastructure is hosted on the AWS platform, we have data streaming from Braze Currents to AWS S3 continuously.

After setting up a AWS IAM role that has access to write objects to the S3 bucket, below is an example of the configuration required to continuously stream to AWS S3 cloud storage that needs to be set up in the Braze portal.

Braze Currents Integration Setup in Portal

Below is a sample of Braze Currents events data that is exported to the AWS S3 bucket. The data is exported into Avro format as it is a flexible data format that natively supports schema evolution. Currently, we have enabled 37 event types integrating from Braze Currents to S3.

Folders created by Braze’s export to S3

Step 2 — Snowflake External Tables

Now that the data is syncing to the AWS S3 bucket, we have created Snowflake external tables on top of the AWS S3 data files to query this data in Snowflake. An external table in Snowflake allows you to query data stored in an external stage as if the data were inside a table in Snowflake.

Steps involved in exporting and using event types

  1. Create a Snowflake storage integration with AWS S3 bucket where the Braze Currents events are synced.
  2. Create an external table in Snowflake that allows you to query data stored in an external stage as if the data were inside a table in Snowflake.
  3. Create a materialised view/table in dbt on top of these external tables and flatten the json fields so that we have performant queries running on Braze Currents data.

Example: User Behaviours and App First Session Events

To create storage integration in Snowflake:

create storage integration BRAZE_DATA_S3_INT
type = external_stage
storage_provider = s3
storage_aws_role_arn = 'aws arn role which has read access to s3 bucket'
enabled = true
storage_allowed_locations = ('s3 bucket data integration path')

To create external stage in Snowflake:

CREATE OR REPLACE STAGE PROD.S3_INTEGRATION.STAGE_BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION   
URL='s3 bucketintegration path/event_type=users.behaviors.app.FirstSession/'
STORAGE_INTEGRATION = BRAZE_DATA_S3_INT
FILE_FORMAT = (type=avro);

To create external table in Snowflake using dbt:

sources:
- name: braze
database: PROD
schema: S3_INTEGRATION
tables:
- name: BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION
external:
location: '@PROD.S3_INTEGRATION.STAGE_BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION'
file_format: "( type = avro )" # or a named file format
auto_refresh: false # depends on your S3 setup
partitions:
- name: date_part
data_type: timestamp
expression: "to_timestamp_ntz(SPLIT_PART(SPLIT_PART(metadata$filename, '/', 4),'=',2) , 'yyyy-mm-dd-hh24')"

To refresh Snowflake external table in dbt:

dbt run-operation stage_external_sources --args 'select: braze.BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION'

To create final materialised incremental table on top of external table and to refresh using dbt:

{{
config(
materialized='incremental',
tags=['braze']
)
}}
with braze_unpack as (
select * from {{source('braze','BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION') }}
where date_part = (select MAX(date_part)from {{source('braze','BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION') }})
)
select
a.value:app_id::string as app_id,
a.value:external_user_id::string as external_user_id,
a.value:session_id::string as session_id,
a.value:platform::string as platform,
a.value:id,::string as id,
a.value:device_model::string as device_model,
a.value:sdk_version::string as sdk_version,
a.value:timezone,::string as timezone,
a.value:country::string as country,
a.value:os_version,::string as os_version,
a.value:device_id::string as device_id,
a.value:language,::string as language,
a.value:user_id::string as user_id,
file_path,
date_part
from braze_unpack

You then use dbt’s run command to produce your dataset:

dbt run --select tag:braze

The above external table refresh and materialised table refresh can be scheduled multiple times in a day to export data and load this to Snowflake to support downstream analytics and data science use cases.

Conclusion

In conclusion, the above approach has worked out well for us. We’ve been running this in production for 10 months and for the most part its been smooth sailing. While there is overhead of pulling data in like this, the performance and cost benefits we’ve realised have been beyond our expectation.

It’s worth calling out some caveats though. The approach requires more work than using an existing framework for ingestion like Airbyte, Fivetran or something else. It’s certainly a useful pattern to work with though and apply where it makes sense to keep cost and performance in check.

--

--