Data Pipelining — A Primer
Data pipelining is the act of moving data from one location to another and is one of the most critical operations in data infrastructure. Data pipelines are a superset of Extract, Transform, Load (ETL) and modern ELT. Additional data pipeline technologies include reverse ETL and integration and workflow automation vendors. While data pipelining isn’t new, as more companies have become data-driven, moved to the cloud, and adopt cloud data warehouses, we’ve seen the emergence of numerous new data pipelining solutions.
Data pipelines help teams get more leverage out of their data for business decisions and product experiences. With the proliferation of SaaS and cloud data stores, data is fragmented across business systems. Data silos limit data’s utility and make it hard to have a complete picture to make businesses decisions. By using data pipelines to 1) consolidate data across disparate resources in a data warehouse or 2) push data between operational systems, teams can make better decisions and move more quickly.
Often data pipelines are referenced in the context of ETL or the more modern ELT. ETL extracts data from one system, transforms the data, and loads the data into a database or data warehouse typically in batches. ELT systems extract and load data into databases and then transformations occur at the database. Data pipelines are a superset of ETL and ELT because data pipelines don’t have to transform data as part of the pipeline, can move data in real time (streaming) instead of batches, and can load data into any number of targets like a data lake or a SaaS solution.
In the ETL and ELT context, data pipelines help move data to data warehouses so it can be used for analytics/BI and leveraged by business applications like modern CRMs. Integration and workflow automation vendors and reverse ETL data pipelines can move data into SaaS tools for operational analytics. When data pipelines move data into production databases this information can be feed into product experiences.
Informatica, Talend, and Matallion are well-established offerings in the ETL space; however, with the emergence of cloud data warehouses we’ve seen numerous new players like Fivetran. Fivetran gained traction because of its straightforward implementation and ease of use. In turn, it was better able to attract mid-market customers than the enterprise-focused incumbents. As data pipelines have become a necessary component for modern data infrastructure there is demand from startups and SMBes for solutions that meet their needs cost effectively. New, often open-source vendors like Airbyte, Rudderstack, and Meltano are making headway. They lower the barrier to entry for adopting data pipelining and expand as users grow.
From our research we found data pipeline buyers care the most about ease of use, number of integrations, cost, and ability to customize. A vendor’s breadth of connectors is particularly important and reflects a power law. When a vendor doesn’t have a connector, teams write homegrown data pipeline scripts or use a second data pipeline vendor to fill the gap. As more traditional industries like real estate, construction, agriculture, manufacturing, etc. become data-driven they particularly care about connectors for their vertical-oriented SaaS products. We found that the majority of data pipeline users connect to less than 25 cloud apps but expected to nearly double in the next 2–5 years. One reason open-source solutions are gaining ground is that they allow community members to write their own connectors and contribute them back to the platform. A data pipeline vendor’s ability to easy add connectors is crucial to their success.
Today many data pipelining solutions are hosted and closed source. However, as data privacy and security become more top of mind, we hear customer interest in solutions that can be run on premise for security reasons. There are also asks to identify PII and PHI data moving in the pipeline. Some buyers also express a preference for open source for customization and control. While Singer taps were introduced in 2017, new open-source vendors like Airbyte are gaining ground because Singer isn’t actively managed and controlled leading to outdated and nonstandard taps. Meltano is trying to help improve Singer tap standardization and management.
Below we categorize data pipeline technology across a few categories: ETL/ELT, reverse ETL, and integration and workflow automation (sometimes known as iPaaS). ETL/ELT users are typically data engineers moving data from SaaS applications to databases. Reverse ETL helps operational teams move data from data warehouses to SaaS systems for operational analytics. Integration and workflow automation platforms typically serve business users through a GUI and can move data from systems to databases or SaaS tools.
Data pipelines are a crucial component of the modern data stack and can be adopted in many forms including ETL/ELT, reverse ETL, and integration and workflow automation. Open source offerings are starting to gain ground with security and cost considerate adopters. If you or someone you know is working on a data pipeline startup or adjacent offering, it would be great to hear from you. Comment below or email me at email@example.com to let us know.