Loading billions of Braze Events into Snowflake in a cost effective and scalable way

Published in

Zip Technology

5 min readMay 30, 2023

At Zip, we have been migrating to a new modern data platform. The migration journey from Athena to Snowflake was progressing well until we encountered an obstacle while loading Braze Currents data. For those unfamiliar with Braze Currents, it is a vital component of our Customer Engagement Platform, crucial for comprehending the effects of our engagement strategy and enhancing communication with our customers. Braze Currents captures a diverse range of real-time event data, as illustrated below, and in our case, exports it to an S3 bucket for further analytics and data science workloads.

Zip’s data platform and the challenges we faced

Zip’s data platform consists of a combination of paid SaaS and Open Source tooling including Snowflake, dbt, Fivetran, Airbyte, Airflow, Kafka and Snowplow. Our infrastructure is hosted on AWS.

The challenge we encountered went beyond technical difficulties; we needed to find a solution that was both scalable in performance without a significant rise in costs. The existing framework we have in place for loading external data leverages Fivetran, which allows us to quickly onboard new datasets using pre-built data models. Initially, we considered using Fivetran’s connectors for Braze Currents ingestion, but realised this would be cost prohibitive for us with the volumes we were looking at. We explored several alternatives before settling on our current setup which we selected due to its simplicity, scalability, and cost-effectiveness. This approach is outlined in detail below.

**Braze Currents <> Snowflake Solution Architecture**

Below are the steps in detail:

Step 1 — Braze configuration

There are multiple options to integrate the Braze Currents data first into cloud data storage like Amazon S3, Azure Blob Storage, Google Cloud Storage etc. Since our infrastructure is hosted on the AWS platform, we have data streaming from Braze Currents to AWS S3 continuously.

After setting up a AWS IAM role that has access to write objects to the S3 bucket, below is an example of the configuration required to continuously stream to AWS S3 cloud storage that needs to be set up in the Braze portal.

**Braze Currents Integration Setup in Portal**

Below is a sample of Braze Currents events data that is exported to the AWS S3 bucket. The data is exported into Avro format as it is a flexible data format that natively supports schema evolution. Currently, we have enabled 37 event types integrating from Braze Currents to S3.

**Folders created by Braze’s export to S3**

Step 2 — Snowflake External Tables

Now that the data is syncing to the AWS S3 bucket, we have created Snowflake external tables on top of the AWS S3 data files to query this data in Snowflake. An external table in Snowflake allows you to query data stored in an external stage as if the data were inside a table in Snowflake.

Steps involved in exporting and using event types

Create a Snowflake storage integration with AWS S3 bucket where the Braze Currents events are synced.
Create an external table in Snowflake that allows you to query data stored in an external stage as if the data were inside a table in Snowflake.
Create a materialised view/table in dbt on top of these external tables and flatten the json fields so that we have performant queries running on Braze Currents data.

Example: User Behaviours and App First Session Events

To create storage integration in Snowflake:

create storage integration BRAZE_DATA_S3_INT
  type = external_stage
  storage_provider = s3
  storage_aws_role_arn = 'aws arn role which has read access to s3 bucket'
  enabled = true
  storage_allowed_locations = ('s3 bucket data integration path')

To create external stage in Snowflake:

CREATE OR REPLACE STAGE PROD.S3_INTEGRATION.STAGE_BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION   
URL='s3 bucketintegration path/event_type=users.behaviors.app.FirstSession/'
    STORAGE_INTEGRATION = BRAZE_DATA_S3_INT
    FILE_FORMAT = (type=avro);

To create external table in Snowflake using dbt:

sources:
  - name: braze
    database: PROD
    schema: S3_INTEGRATION
    tables:
      - name: BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION
        external:
          location: '@PROD.S3_INTEGRATION.STAGE_BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION'
          file_format: "( type = avro )"  # or a named file format
          auto_refresh: false # depends on your S3 setup
          partitions:
            - name: date_part
              data_type: timestamp
              expression: "to_timestamp_ntz(SPLIT_PART(SPLIT_PART(metadata$filename, '/', 4),'=',2) , 'yyyy-mm-dd-hh24')"

To refresh Snowflake external table in dbt:

dbt run-operation stage_external_sources --args 'select: braze.BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION'

To create final materialised incremental table on top of external table and to refresh using dbt:

{{
    config(
        materialized='incremental',
        tags=['braze']
    )
}}
with braze_unpack as (
select * from {{source('braze','BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION') }}
where date_part = (select MAX(date_part)from {{source('braze','BRAZE_USERS_BEHAVIORS_APP_FIRSTSESSION') }})
)
select 
a.value:app_id::string            as app_id,
a.value:external_user_id::string  as external_user_id, 
a.value:session_id::string        as session_id,
a.value:platform::string          as platform,
a.value:id,::string               as id, 
a.value:device_model::string      as device_model,
a.value:sdk_version::string       as sdk_version,
a.value:timezone,::string         as timezone, 
a.value:country::string           as country,
a.value:os_version,::string       as os_version, 
a.value:device_id::string         as device_id,
a.value:language,::string         as language, 
a.value:user_id::string           as user_id,
file_path,
date_part
from braze_unpack

You then use dbt’s run command to produce your dataset:

dbt run --select tag:braze

The above external table refresh and materialised table refresh can be scheduled multiple times in a day to export data and load this to Snowflake to support downstream analytics and data science use cases.

Conclusion

In conclusion, the above approach has worked out well for us. We’ve been running this in production for 10 months and for the most part its been smooth sailing. While there is overhead of pulling data in like this, the performance and cost benefits we’ve realised have been beyond our expectation.

It’s worth calling out some caveats though. The approach requires more work than using an existing framework for ingestion like Airbyte, Fivetran or something else. It’s certainly a useful pattern to work with though and apply where it makes sense to keep cost and performance in check.