3 Ways Data Engineers at Snowflake Leverage Reverse ETL

As a Data Engineer at Snowflake, it is my job to unify and operationalize data across our application stack to support the business. To accomplish this, our team leverages many tools and principles, one of which is reverse ETL.

Snowflake unlocks near real-time insights by centralizing an organization’s data within a single platform. Capabilities like Secure Data Sharing, External Functions, and Tasks enable data engineers to manage and maintain data that is foundational to our organization. Reverse ETL allows us to activate this data across our application stack, putting actionable data in the hands of our business partners and even automating previously manual efforts.

The Snowflake Data Cloud is the center of our team’s application and data stack. Allowing us to have a shared backend and understanding across our organization.

How we use reverse ETL

At Snowflake, we typically get around 1–2 requests for a reverse ETL job each week. These requests typically fall into three categories: data enrichment, automation, and versioning.

Data Enrichment

Data enrichment is a common use case for Snowflakes as a product. Data engineers are often tasked with combining relevant data from across their applications to create a single source of truth.

At Snowflake it is no different. We also make sure to push that data back into any application where it would be useful. Not all users have access to our Snowflake instance or want to go back and forth between BI tools and their primary apps to access important information. This makes it important to feed that data back into all host applications so all users have access to the data. This provides greater visibility to end users and creates a shared backend between our applications.

A great example of how we use reverse ETL for data enrichment is how we manage address data in our sales organization. Address data needs to be accurate for a variety of use cases from billing to account ownership.

In order to improve address accuracy, our team used a third party application. This data is ingested into Snowflake and modeled with our account data in order to determine the correct address. Everytime this pipeline is run, we then push the data back into Salesforce so it is in the hands of the sales team, ensuring they have the most accurate data.

Automation

One of our most powerful applications of this infrastructure is the ability to automate changes and workflows between different applications. By tracking changes in our applications in Snowflake, we are able to automate those changes in downstream applications that previously had to be done manually.

This alleviates time spent on maintenance and manual efforts by the operations teams, while ensuring accuracy and increased speed of data delivery.

For example, today our financial planning team manages territory assignments of both users and accounts in a tool called Adaptive. This is a critical piece of our business as it determines both access to data as well as compensation of our sales org.

Anytime an assignment is created, deleted or modified, it is processed by Snowflake and then sent back to Salesforce to ensure both systems are in sync. This saves hours of headache on manual changes being made in Salesforce by the planning team, while enabling us to make changes frequently without fear of error.

Versioning

Finally, we leverage this infrastructure to version our application data. By using Snowflake as a common backend, we are able to store all versions of the data sent between applications. This enables us to be resistant to any process or application failures by allowing us to always roll the state of the application back to a previous data state.

Reverse ETL Infrastructure at Snowflake

Internally we have already built a variety of reverse ETL connectors for analysts and data engineers to use. These connectors have been built for a variety of applications such as Salesforce, Workday, Anaplan, and ServiceNow.

These connectors have been built in one of two ways. The first is Python-based Airflow operators. The second approach is to use External Functions, which allow you to keep your reverse ETL work native to Snowflake.

Python Operator Implementation

Our team leverages Airflow heavily for our data pipelines and data modeling work. Therefore when we started building this infrastructure, our team thought it made sense to use an Airflow-based approach.

This would ensure that any data analyst, engineer or scientist would already be able to interact with the work in a common way. This resulted in our team building an operator that would take the following arguments and operate on a host application.

  • Query — Represents data to update
  • Object / API — Location to put data
  • Job Type — Create, update, delete
  • Batch Size — Number of records to attempt in a batch
  • Fields — Maps query data to fields in source

Once the end user has configured these fields, a task in Airflow will authenticate with both systems pull the data from both. Next, compare to determine if data should be pushed.

Finally, push records into the app and log the changes with success or failure

External Function Implementation

When we started building our reverse ETL infrastructure in 2018, Airflow and Python were our best option. But today, with Snowflake’s External Functions feature we can make API calls natively from in Snowflake.

This is now our team’s preferred approach as it allows us to do all of our work, from modeling to ingestion, to even reverse ETL, all on the Snowflake platform. External functions can actually help complete all those workloads.

My colleague previously wrote an article about how our team uses and how to build external functions here. By using put request and api urls, an end user can use these practices to write their data out of Snowflake into an application layer.

Example using external functions

Use case — To update administrator details of business applications from Snowhouse to Servicenow using external functions.

Architecture:

Fig : flow of Servicenow REST API call using external function

The above architecture defines how the external function works for Servicenow REST API calls. Firstly, administrator data of business applications is coming from Snowhouse — a Snowflake account that we use internally and has identical functionality to what our customers use — — and passing this data to an external function to forward the request to the actual AWS remote service to make API requests.

Further, Snowflake reads the external function definition and the corresponding API integration information and returns the API response in the form of a JSON object.

External function creation :

CREATE OR REPLACE SECURE EXTERNAL FUNCTION servicenow_api_call_put (url string, auth_headers string, data string )

RETURNS VARIANT

RETURNS NULL ON NULL INPUT

VOLATILE

MAX_BATCH_ROWS=1

API_INTEGRATION=dev_it_data_ingest

HEADERS=(

‘api-url’=’{0}’

‘api-method’=’put’

‘headers’=’{1}’

‘secrets’=’servicenow_qa_secret’

‘api-post-data’= ‘{2}’

‘api-post-data-string’= ‘true’

)

AS ‘https://*****.execute-api.us-west-2.amazonaws.com/default/it_data_ingest_lambda';

The above external function contains arguments such as Servicenow REST API URL, header and data JSON that we want to update in Servicenow. API_INTEGRATION is an object which contains security details to connect to the AWS lambda remote service that has already been created to make API requests. HEADERS contain API details with secrets. The secret key is stored in AWS secret manager and contains username & password.

Invoking external function :

select servicenow_api_call_put(

‘https://{{instance}}.service-now.com/api/now/table/{{table_name}}/{{sys_id}}' , ‘{“Authorization”: “Basic {{servicenow_qa_secret}}”}’ , ‘{“u_administrator_s”:”01d95751dbf3ab003406356239961917", “u_admin_count”:”1"}’)

as data;

We have to pass following parameters in the external function:

  • Servicenow REST API URL which contains “instance” name and “sys_id”(system ID) of the business application.
  • Header — contains secret key “servicenow_qa_secret”.
  • Data JSON — fields that we have to update in Servicenow.

This query will return an API response in JSON. The data in the JSON will then be used to update the information in Servicenow.

Reverse ETL is a very powerful tool that we leverage on a regular basis. Now with Snowflake, we can build these pipelines directly on the platform and distribute them to any user in Snowflake. We’d love to hear how your team is using reverse ETL. Let us know in the comments.

Special thanks to Ganesh Gadakar, Shradha Adsule, Satya Kota, Kristen Werner, and the many other people who have helped us build and maintain this program!

--

--