Reverse ETL — A Primer
Data infrastructure has gone through an incredible evolution over the past three years. We have moved from Extract, Transform, Load (ETL) to ELT, where raw data is copied from the source system loaded into a data warehouse or data lake and then transformed. Now teams are adopting yet another new approach, called “reverse ETL,” the process of moving data from a data warehouse into third party systems to make data operational. The emergence of reverse ETL solutions is a useful component of the stack to get better leverage out of data.
As teams rearchitect from ETL to ELT, the data warehouse becomes the single source of truth for data, including customer data that can be spread across different systems. Solutions that have enabled this new architecture include Fivetran and Airbyte for EL, DBT for T, and Snowflake and Redshift for the data warehouse. Traditionally data stored in data warehouses were used for analytical workloads and business intelligence applications like Looker and Superset. Data teams are now recognizing that this data can be further utilized for operational analytics, which “drives action by automatically delivering real-time data to the exact place it’ll be most useful, no matter where that is in your organization.”
Sending data in real time to SaaS systems can be helpful for making sure there is a consistent view of the customer across all systems. There are many use cases for reverse ETL. Pushing data to Salesforce from the data warehouse means you can have an up-to-date list of high lifetime value customers or customers that spend more than $500/month. Mirroring product usage data in Intercom can help improve customer interactions by supporting personalized messages that include product metrics. Syncing internal customer data into Zendesk can save time when responding to support requests or automatically prioritize messages when they come in. The list goes on and on.
Until recently, teams wrote their own API connectors from the data warehouse to SaaS products to pipe the data into operational systems like Salesforce, Marketo, HubSpot, etc. It can be hard to write these connectors because endpoints may be brittle and most APIs are not built to handle real-time data transfer so teams must setup batching, retries, and checkpointing to avoid rate limits. Mapping fields from the data warehouse to SaaS product can take time. Our research suggested it can take a few days to a week to write some of these connectors. From there, it can be challenging to maintain the connectors over time because API specs change. Because of these challenges, reverse ETL solutions emerged.
Reverse ETL solutions are the flipside of EL tools like Fivetran. They offer out of the box connectors to numerous systems, so teams no longer need to write and maintain them. Before teams might have only written a few connectors for their top SaaS products because of the engineering burden to address the long tail. With reverse ETL tools, they can now push data into more systems, getting better use out of their data. Moreover, reverse ETL tools provide a visual interface to choose which query output columns are used to populate standard and custom SaaS fields. They allow you to continuously sync or define what triggers the syncing between the two systems. For example, after a dbt job is run it can trigger the sync. Reverse ETL solutions log and monitor sync status and progress and notify teams if they need attention. Solutions can create customer segmentation, audiences, and lead scoring through a visual analysis interface or dbt model outputs that can be pushed downstream.
Both data and GTM teams benefit from reverse ETL. Data teams now only have to maintain a single data pipeline compared to multiple. They no longer have to write scripts and have visibility and control over syncs. Sales, marketing, growth, and analytics teams can analyze and act upon the same, consistent, and reliable data. Data consistency helps create a continuity across the business since functional teams are working off the same data even if using different SaaS products. Fresh data in SaaS products help improve decision-making and accelerates it.
There are now a handful of startups building reverse ETL products including Hightouch, Census, Grouparoo (open source), Polytomic, Rudderstack, and Seekwell. We’ve heard businesses also use Workato to enable reverse ETL with differential sync.
Some readers may be asking, “Can’t I do this with Segment Personas?” Personas is bundled with core Segment and is not self-service. Segment Personas’ SQL importer can import data from the warehouse, including from non-Segment sources and results from dbt transformations. A user can add custom fields (string, number, boolean) to Segment Persona users and accounts then send those fields to destinations through Segment. For example, the SQL import can add a custom field like “number of Zendesk tickets.” However, the data must conform to Segment’s data model which means it must be applied to a property on a user or account. Reverse ETL solutions do not have this limitation since they do not have users and accounts. Simply, they pipe the data straight to the destination.
Personas also doesn’t support sending other object types to destinations (e.g. Salesforce/CRM custom objects, metadata on Zendesk tickets or Stripe charges). Personas uses streaming APIs but batch loads so API rate limits can be an issue. If you are in a regulated industry it is important to note Personas stores a copy of all customer data. This does not happen with reverse ETL solutions.
Reverse ETL solutions are becoming a core piece of the data stack. We are excited to watch as the ecosystem evolves. If you or someone you know is working on a reverse ETL startup or adjacent offering, it would be great to hear from you. Comment below or email me at email@example.com to let us know.