ETL / ELT Market Teardown in 2021

The Case for a Next-Generation of Data Integrators

Monica Mishra
Primary Venture Partners
7 min readFeb 8, 2021

--

“If data is the new oil, then Fivetran is the pipes that get it from the source to the refinery” — Martin Casado

Introduction

The data management stack has experienced rapid evolution over the past five years, fundamentally disrupting established data norms within the enterprise. This flurry of innovation has had a cascading effect and produced new brand names such as Snowflake or Looker. One of the last spaces to fully respond to the disruption, and thus, one of the most exciting spaces in data management for entrepreneurs and investors alike is the next generation of data integrators.

Figure A: Basic ETL architecture

Data integrators are also known as ETL (Extract, Transform, Load) or ELT. See figure A for its basic architecture.

These invisible data pipes are critical to unlocking the full potential of emerging data management solutions. The right ETL/ELT tool abstracts away the complexity associated with data handling allowing for higher quality insights trusted by the entire organization. They can democratize access to enterprise data and save valuable engineering resources.

But incumbent ETL tools are riddled with issues and have been unable to adequately respond to the shifting technical landscape. The time is now for new entrants. In this article, I briefly cover the changes across the data management stack, dive into new user needs and issues with incumbents, introduce the ETL market map, and then close with some success factors for the next generation of data integrators.

Founders, if I’ve misplaced you or a friend on any of the maps below, please reach out to correct me. This goes for everyone, send me your feedback, corrections, and questions — I’d love to hear from you.

Changes Across the Data Management Stack

Before diving into the shifts in this space, let’s first get in sync on the definitions of some related yet distinct data terms:

Data Infrastructure: Architectural layer supporting data orchestration, management, consumption, and sharing across an organization

Data Management: Subset of Data Infrastructure → Process of managing, manipulating, and visualizing data across an organization

Data Pipelines: Subset of Data Management → Set of processes that move data from one system to another

ETL Pipelines: Subset of Data Pipelines → A Data Pipeline that always transforms the data and loads it into a database, an operational data store, a data mart, data lake, or a data warehouse

With that in mind, we turn to the modern data management stack. In 2019, the space was worth ~$90B. Data integration was worth ~$10B but was growing at the fastest rate among all the data management components.

Figure B: The modern data management stack

Each of the data management components is experiencing independent change that amplifies one other in a cascading way. Low-cost sensors and innovation in IoT have increased data proliferation on the Capture side. Cloud compute and storage costs are approaching zero, while compute capabilities are only increasing, thus powering rapid evolution in the Storage, Synthesis, and Visualization components. Increased legislation like GDPR or CCPA has mandated that companies maintain robust governance solutions, thus fueling activity in that space as well. Underscoring all these changes is a consistent enterprise trend from on-prem to hybrid or multi-cloud architectures.

Figure C: Changes across the modern data management stack

New User Needs

Users have responded to this innovation in data management by changing their expectations for both the ubiquity and utility of enterprise data. In other words, users now expect more from more data. These are the relevant new user needs:

Data-driven insights for operations → Users want customized learnings from enterprise data that can directly improve business execution

Generate real-time insights → Real-time insights from data streams are essential to unlocking certain use cases such as fraud detection, medical IoT, HFT, etc.

Visibility and control over data → Legislation such as GDPR have made it mission-critical for companies to maintain full visibility and control of their organizational data pathways

Ability to utilize all data → In the modern enterprise, data is stored in multiple cloud providers, data silos, and internal sources. Users are increasingly demanding access to all of that data to generate insights

Democratize access to data → Business analysts are increasingly being asked to set up the pipelines as well as interpret the results from the pipelines. There is a growing appetite for modern data pipeline tooling for less technical buyers or “citizen integrators”

Existing Issues with Incumbents

All of these changes across the data management subcomponents have also put pressure on existing data integrators and the following issues have surfaced:

End users don’t trust results → ETL pipeline jobs are often daisy chained which makes it difficult to trace its origin, verify its accuracy, or control for mathematical consequences of inconsistent transformations. When end figures, say — monthly revenue numbers — don’t match others’ manually calculated figures, trust in data is broken

Expensive to set up and maintain → ETL pipelines are usually custom jobs that require an expensive data engineering team to connect the source to the database. If a non-technical team is working on a project that requires a new pipeline, 80–90% of their time is spent manually wiring it up

Scales poorly → Mapping data from its source to enterprise data storage is rarely a 1 to 1 relationship. Once any underlying data schema changes, the ETL pipeline job breaks. The more connectors and jobs, the more frequent this becomes

ETL Market Map

Speaking of incumbent ETLs, here’s a market map of the data integrator space. If I’ve missed or misplaced anyone, please let me know. There’s space across the new-gen categories, especially in the open-source category. The market has space for both horizontal entrants, which serve a variety of use cases, as well as verticalized players, which serve a niche.

Figure D: Data integrator market map

The line separating new-gen vs incumbent players is debatable. Here’s how I thought about the difference.

Figure E: Some differences between incumbent and new-gen ETLs

Success Factors

Given these fundamental shifts across the data management stack, which have triggered these new user needs and surfaced existing issues with incumbent ETLs, we’ve identified some success factors for the next generation of data integrators.

Not all good data integrators will have these factors, and not all data integrators that have these factors will be good. That said, think carefully about any missing factors. Some matter more than others, given the context of the business and the tool. These factors should serve as a framework to assess emerging tools in the space, but it’s only part of the overall assessment.

Figure F: Some success factors for the next-generation of data integrators

Conclusion

Congrats on making it here! That’s a lot of information packed in a relatively small amount of space. The main takeaway is that ETLs are the last frontier to effectively respond to the data management revolution. They are critical infrastructure to a data-rich future and constitute an ever-growing market. It is a great time to be founding or investing in a next-generation data integrator tool.

If you are building something in this space, have feedback, or would like to collaborate on this work, please reach out! I’d love to hear from you.

This article is part of Primary Venture Partners’ market teardown series by their MBA associate cohort. See here for the full series and video links.

Sources & Acknowledgments

sources | James Mayfield, Cofounder, Transform, 2020: An island of truth: practical data advice from Facebook and Airbnb | Jon Ma, Cofounder, Public Comps, 2019: Data Tooling Marketmap — 2019 | Martin Casado, a16z blog, 2019: Fivetran | Priyanka Somrah, Work Bench, 2020: Work-Bench Snapshot: The Last Mile of the ETL Framework | Qubole Blog, updated 2020: Top Apache Spark Use Cases | Stitch Data Blog: ETL Database | Yue Wang, Title, 2019: How We Build Data Pipelines at Tile |

acknowledgments | John Wu, FirstMark Capital | Tyler Zon, Great Hill Partners | Itamar Hemo, Rivery | Vardit Jonasz, Rivery | Howard Xu, TCV | Eric Anderson, Scale VP | the entire team at Primary | thank you |

--

--

Monica Mishra
Primary Venture Partners

Software eng @ Bridgewater Associates, MSFT | Associate @ Primary Ventures | 2nd yr MBA student @ Harvard Business School | Newly tweeting @monica_moneeka_