Who’s Who in the Modern Data Stack Ecosystem (Fall 2021)

Jordan Volz
10 min readOct 29, 2021

--

(This article originally appeared on the Continual blog)

In our previous article, The Future of the Modern Data Stack, we examined the motivations of the modern data stack, its current state, and looked optimistically into the future to see where it is headed. If you’re new to the modern data stack, we highly recommend giving the aforementioned article a read. A question we often get from new adopters of the modern data stack is “What tech should we be looking into?”. It’s a great question, as there are many different components to the modern data stack and as its popularity grows, many companies aspire to re-brand and jump on the bandwagon. We thought providing a roadmap to the modern data stack would be a great resource for any who are just starting to get acquainted with the ecosystem.

A Brief History of the Modern Data Stack

The modern data stack is a collection of cloud-native tools that are centered around a cloud data warehouse and together comprise a data platform. The benefits of adopting a modern data stack are many:

  1. Ease of Use: SaaS technologies allow your team to not worry about installing and maintaining technology. Everything is built for the data warehouse so this minimizes integration pains and siloed data platforms that require lots of effort spent shifting data around.
  2. Wide Adoption: The modern data stack is constructed with the intention of upskilling data workers and removing the barriers between workflows; anyone can be a data engineer, data analyst, or machine learning engineer with the right tooling. SQL is the lingua franca that creates a common foundation to work with data across disciplines. A common theme we’ve noticed among modern data stack adopters is that people no longer focus on just one discipline and are now hybrid “data engineer/data analyst/data scientist”.
  3. Automation: Tools that don’t focus on automation place a huge technology burden on users when it comes time to operationalize data workflows. We often refer to these strewn-together systems as “pipeline jungles”, where, over time, it can become almost impossible to detangle the complex web of logic. Automation needs to be a core feature of data tools.
  4. Cost: Say goodbye to predatory vendors with high entry fees. In the cloud, you pay for what you use, and nothing more. A side effect of tools having wide adoption and a focus on automation means your data workers can get more done, in less time, with fewer resources. This has benefits in terms of the cost to staff up a data team as well.

The modern data stack is really the resurgence of the data warehouse as a primary data store for data workloads. After several decades of dominance, the data warehouse began to fall out of fashion in the era of “big data” as data lakes briefly rose to prominence. Data Lakes ultimately proved too complex and costly for most organizations, and the fast adoption of cloud infrastructure in the 2010s provided a great opportunity for the data warehouse to make a comeback, this time built for the cloud and incorporating many of the technical aspects from the big data movement. From there, it was inevitable for an ecosystem of tools to assemble to reenvision data workflows for the cloud era.

Qualifying for the Modern Data Stack

What makes a product part of the modern data stack? In our previous article, we laid down some guiding principles, which we’ll also use here. Specifically, to be part of the “modern data stack”, a technology must be:

  1. A Managed Service: If we’ve learned one thing from the onset of cloud technologies it is that investing in technologies that require installation and maintenance carries a huge burden for the customer. They also tend to be more expensive (How does that make sense?). If you’re not in an industry that requires on-premise or private cloud installations, it’s becoming more ludicrous with each passing year to consider adopting non-SaaS technology. The modern data stack focuses on SaaS technologies so your teams can focus on the data, not the technology.
  2. Cloud Data Warehouse-Centric: By focusing on the cloud data warehouse we are maximizing adoption and user profiles who can leverage the tool. This also minimizes integration pains and users are assured that things actually work because their warehouse is not one of five hundred data source integration options. Unsurprisingly, by keeping the data stack simple, things work a lot better.
  3. Operationally Focused: There are many tools that are great for development or prototyping but then quickly fall apart once it’s time to productionalize the work product. The modern data stack is opposed to this ideology and tools must be developed with operationalization in mind. Getting from development to production should be a simple process, not something that requires rigging together pipelines and API calls.

Highlights: Modern Data Stack Ecosystem — Fall 2021 Edition

As the modern data stack continues to grow and evolve, many new technologies and vendors are entering the conversation. Below is our take on the current main functional areas of the modern data stack and the main vendors in each category. Below, we’ll dive into each one in a little more detail.

Cloud Data Warehouse

Main Tools to Consider: Snowflake, Google BigQuery, AWS Redshift, Databricks SQL.

This is where it all starts! You can’t get started in the modern data stack without a data warehouse to store your data. Snowflake is currently the leader in this area, but every cloud vendor has its own offering and BigQuery and Redshift are commonly used as the foundation for the modern data stack. Databricks can be a disrupter here, as its SQL offering has the potential to lure in larger enterprises who are looking to simplify their Hadoop-era data pipelines but not abandon Apache Spark entirely. One thing is certain: the future lives in the cloud and speaks SQL.

Data Integration

Main Tools to Consider: Fivetran, Airbyte, Stitch.

A data warehouse is only as good as the data that it holds, and it’s only valuable if you can actually get useful data into it. It’s essential for every modern data stack to have a data integration tool, and there are several to choose from. Fivetran and Stitch have been around the longest and have the most traction in terms of helping customers get data into their cloud data warehouse, but Airbyte is a new technology that is open-sourced and is quickly picking up a dedicated fanbase. One advantage of these offerings over legacy ETL tooling is that they put a lot of engineering effort into understanding the underlying source systems APIs and make importing data as easy as a few clicks of a button. Given the complexity of some sources, like Salesforce, it’s impressive that you can go from zero to production in under a day with these tools. Nobody should be writing these integration pipelines themselves anymore.

Event Tracking

Main Tools to Consider: Segment, Snowplow, Rudderstack.

Another aspect of data integration is event tracking or the “customer data platform”. These focus primarily on ingesting events pertaining to customer behavior and additionally offer functionality around transforming your data and loading it into your cloud data warehouse or directly into destinations like Salesforce, Hubspot, or Marketo. Although there is some cross-over functionality with the pure data integration tools above, certain use cases are better solved with an event tracker and it is not uncommon to see customers happily leveraging both. Segment is the most established vendor in this field, but Snowplow is an open-source alternative with its fair share of supporters, and Rudderstack is a newer entry who is gaining a lot of steam of late after Segment was acquired by Twilio.

Transformation

Main Tools to Consider: dbt.

When it comes to data transformation on the modern data stack, there’s really only one tool in town: dbt. dbt has a huge, thriving community, is used by thousands of companies, and is conveniently open-sourced. There was a short blip when Dataform looked like it might challenge dbt’s reign, but following an acquisition by Google, it’s pretty hard to find companies selecting using Dataform over dbt. We’ve even yet to talk to a BigQuery customer who is not using dbt. What about your ETL vendor of yore? Let’s be honest, only dbt has a modern developer workflow and data warehouse-centric design that meets the criteria for being part of the modern data stack.

Artificial Intelligence

Main Tools to Consider: Continual.

AI is a new entry to the modern data stack. We think it is the next logical step for companies who are taking the journey down the modern data stack: they already have well-curated datasets, great processes for ingesting new datasets, and easy ways to connect insights back to the business. The next piece of the puzzle is a tool that enables your team to transform into machine learning engineers and start tackling AI use cases. Continual is the first AI/ML platform co-designed with the modern data stack. It has tight dbt integration and allows users of all profiles to come into the data warehouse and start operationalizing AI in days, not months. We believe we’re the perfect complement for any company that’s looking to get extra value out of the data they are already collecting in their data warehouse. To date, we believe we are the only AI tool that actually lives up to the tenants of the modern data stack, although we’d love some company! But complex MLOps platforms for experts only or point-and-click AI tools without an operational focus need not apply.

Analytics

Main Tools to Consider: Looker, Mode, Tableau, ThoughtSpot, Preset.

The data analytics and BI market has always been one of the most hotly contested categories in the data ecosystem, and it’s no different in the modern data stack. Although Tableau has a large market share overall, Looker and Mode were positioned as cloud-native early on and have entrenched themselves deeply into the modern data stack. Tableau’s close proximity to Salesforce is actually a bonus for many customers, so they are still widely used. Preset represents the open-source tool of choice in the community — now available as a cloud-managed service — and ThoughtSpot has an interesting viewpoint around search-enabled BI that shouldn’t be ignored.

Reverse ETL

Main Tools to Consider: Census, Hightouch, Rudderstack

Reverse ETL is the flip side of the Data Integration category: tools making it easy to get data out of your data warehouse and back into the applications that your business uses. Census and Hightouch both have a lot of momentum and a strong offering. Competition is pushing them to move fast and more companies are experiencing the benefits by the day. For event tracking use cases, you may also want to just keep the entire workflow contained within whichever event tracking vendor is used above, but these point-to-point integrations can miss out on many of the benefits of a data warehouse-centric design.

Governance

Main Tools to Consider — Catalog: Alation, Atlan, Stemma, Acryl Data.

Main Tools to Consider — Observability: Monte Carlo, BigEye, Datafold, Metaplane.

Data Governance is key to any data organization. It is an essential evolution the modern data stack needs to undergo in order to fully mature and make itself attractive to large enterprises. We’re breaking this down into two main categories: data cataloging, i.e. understanding what data exists in the data warehouse and the relations therein, and data observability, which allows you to actively monitor data in the warehouse. Both are crucial technologies to deploy as your data practice grows and becomes more complex. In the former category, Alation is an older catalog with a lot of market share that is relevant for the modern data stack crowd as it has always had a large focus on data warehousing, although there are many new startups that are offering excellent options for modern data stack practitioners: Atlan is an impressive catalog tool that also contains lineage and data quality functionality, and Stemma and Acryl Data are both excellent options built on top of open-source tools, Amundsen (Lyft) and DataHub (Linkedin), respectively. The data observability category is perhaps more cluttered than data cataloging with even less signal, but our early evaluation of the field has us excited about Monte Carlo, BigEye, Datafold, and Metaplane. We would evaluate all of them before making a difficult decision.

We’re Keeping an Eye On

The modern data stack is still growing and evolving rapidly. We’ll plan to update this ecosystem periodically as we notice new trends that have matured enough for inclusion as well as to update vendors who are breaking through as having a significant share of the market. As a teaser, here are some areas that we are keeping a close eye on:

Metrics: A metrics layer for the modern data stack is getting a lot of buzz lately. We think it’s a great idea but it is still in its infancy.

Product Analytics: This fills a similar space as the customer data platforms. Built for product teams, product analytics can supercharge your understanding of your business's products, who uses them, and how they are used. It’s not yet mainstream in the modern data stack, but it’s easy to see how this could become a staple in a lot of stacks.

Notebooks: Although notebooks are super popular with data scientists, they haven’t really broken into the modern data stack in a convincing way. In a sql-centered world, do we need notebooks? Several companies are working on this premise, and it’s not hard to envision the modern data stack opening itself up to additional languages while still staying centered on the data warehouse.

Real-time/streaming: To date, the core of the modern data stack has been on batch applications. We think in a few years this will look entirely different and handling real-time/streaming use cases on the modern data stack will not only be popular, but it will also be common. Several companies are working to pave the way for that future now.

Application Serving & Data Sharing: As we covered in our original article, we think both of these areas are ripe for innovation, whether from existing vendors or as new offerings.

--

--

Jordan Volz

Jordan primarily writes about AI, ML, and technology. Sometimes with a humorous slant. Opinions here are his own.