Why DBT will one day be bigger than Spark

Kris Peeters
Dec 18, 2020 · 7 min read

The world of data is moving and shaking again. Ever since Hadoop came around, people were offloading workloads from their data warehouses to the new and shiny data lakes. And it didn’t take long before Spark, which was open sourced in 2010, became the standard processing engine on data lakes.

Now we see a reverse trend, back to the data warehouse. And with that trend, DBT has risen as almost the de-facto standard for doing transformations on modern cloud-native data warehouses. Using DBT, people are discovering that they can build their data pipelines faster, with fewer engineers and with less maintenance.

https://getdbt.com

I predict this trend will only continue and one day, DBT will be bigger than Spark in terms of number of users, number of jobs, and importance in the data landscape. Three arguments:

  • DBT has a faster adoption today than Spark ever had, at least at the clients we see.
  • DBT can target a broader audience. If you know SQL, you can get started with DBT. With Spark, you need a Scala or a Python background. And not be intimidated by distributed computing.
  • The data market is bigger now. More companies want to do interesting stuff with data, and if you start today, DBT offers a much smoother entry point.

But why is that? Why will DBT build up so much adoption? And why now? Let’s start with the second question, because timing is everything.

Why now?

There are a few reasons why now is a great time for a tool like DBT.

Spark filled a void that isn’t really there anymore

The entire premise behind Spark were the RDDs. This is how the paper on RDDs started: “We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.” There was a clear need to do huge in-memory calculations over multiple machines. Single machines were limited in RAM and the only viable option to do cluster-scale compute was Hadoop, which was based on MapReduce. MapReduce was notoriously heavy on Disk IO and didn’t really capture the value of all the RAM that was lying around.

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Spark filled that void perfectly. All of a sudden, a lot of your big data processing could be done much efficient. And Spark had a very nice functional approach to defining your operations, so your syntax was sweet and short, at least compared to MapReduce code.

But honestly, is that void still there? These days, we all shop our infrastructure together in the cloud. And shopping can be a lot of fun, even for “standard” instances:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/memory-optimized-instances.html

If that’s not enough memory, you can now easily distribute your workloads over multiple machines using services such as AWS Batch or Kubernetes. You don’t need to have a distributed application. If you write many small applications, you can easily scale using standard python code.

Most engagements we have at clients, 95% of our jobs are relatively small python jobs, and maybe for 5% we can use the heavy lifting that Spark offers. Which is great. But Spark is not front and center anymore. It’s just another tool in our toolbox. “Ah, this one can’t run single node? Better use Spark here”. We have built Datafy to not have to worry about any of that anymore. We just tell Datafy to run a job, how many nodes it needs, and how much memory. And then Datafy does some Kubernetes magic behind the scenes for us. It doesn’t care if I schedule 100 small python jobs or 5 big spark jobs. It auto-scales the kubernetes nodes, launches the pods it needs, and scales down again. This saves cloud costs, and saves me the headache to worry about infrastructure. I can focus on my data job.

There is a chronic shortage of data engineers

I hear you say: “Kris, Spark is much more than RDDs today”. Which is true. With Spark, you can now do Streaming, ML, and even SQL. It has become a very complete toolkit. And many data engineers love it for it. Including me. It’s just that, sometimes I feel lonely.

https://giphy.com/gifs/VfyC5j7sR4cso/html5
  • Many data scientists prefer to stick with Python and its rich eco-system of ML libraries. By the way, have you heard of GPU processing?
  • Data analysts have actually mostly been ignored in the last couple of years. They were either creating dashboards in Tableau or building ETL pipelines with Drag and Drop tools. Teaching them to code is just a few bridges too far. They needed something more close to home.

Data warehouses today are much more powerful than they were in 2010.

It’s not just cloud compute infrastructure that has matured. Data warehouses in 2020 are completely different from the on-premise expensive, hard-to-maintain monoliths. All big cloud providers have a strong offering and Google Bigquery in particular makes it super convenient to get started. It is fully featured, it is highly performant, it is widely supported and most-of-all, it is pure pay-as-you-go pricing. You pay per TB scanned. That makes the barrier of entry a lot lower. You don’t need to invest big piles of money anymore to set up a data warehouse. Although, obviously, data warehouses are still not cheap. But neither is running Databricks clusters. The price difference was an argument in 2010. It’s not in 2020. And scaleability is definitely not an issue anymore either. Besides Bigquery, companies like Snowflake create tremendous traction in the market and prove that they can execute at a massive scale.

Why DBT?

All the ingredients are there to have a disruption in the market. So why is DBT gaining so much traction? Here’s what I think they got right:

They executed very well on an “obviously” great idea.

With DBT, you build your data pipelines using SQL. You can build modular pipelines and reference other pieces of your data model through variables and macros. This is an idea that I’ve seen at least 5 times in industry. Always some kind of duct-tape solution put together to schedule a bunch of SQL jobs. I have to be honest, we built some of these solutions as well. But it never really dawned us to productise this into something bigger and bring it to the market. It is an obviously great idea. But DBT actually executed on that. Kudos to them.

They have an incredible momentum in the market

They announced their Series B already last month. That’s 7 months after their Series A. They are backed by some of the most famous VC firms like Andreessen Horowitz and Sequoia Capital. Their biggest competitor (that I know of) is Dataform, which just got acquired by Google Cloud. Dataform was already lagging behind. This will only make them even more a niche player. Great if you’re on GCP. But I don’t think Google has any plans to make Dataform shine on Redshift, Synapse or Snowflake.

They are strong on the engineering aspect

As a data engineer, I’m always a bit sceptical of tools that claim they can take complexity away and now “everyone can build a data product in 3 easy steps”. Often, these are excel-look-a-likes or drag-and-drop products with a shiny UI, which are very impressive in sales demos. But they do make you cry the day you need to maintain those dragons in production.

DBT is different. In DBT, you can easily work with variables, you can build modular code, you can add unit tests, you can commit all your code to git and easily integrate DBT in your CI/CD pipeline. It even generates a documentation site for you, including lineage.

https://docs.getdbt.com/docs/building-a-dbt-project/documentation/

All these things are boring to business people, but as an engineer, they give you a lot more confidence that you can actually support DBT workloads in production and you can actually build healthy release processes around it.

That’s why we were also quick to integrate DBT on Datafy. Scheduling DBT is no different than scheduling a spark or a python job. It’s a docker container that can be build, deployed and executed. That docker container just happens to contain SQL code, instead of Python or Spark code.

So Spark is dead?

Not at all! I think Spark is a great tool if you have big data workloads which need a lot of heavy lifting, and you have the engineers available to build the pipelines for you. Spark is still a lot more expressive than SQL, and you have much more control over how the processing is done in Spark than in SQL.

In general, the data landscape is constantly in flux. Technologies come and go. It’s a matter of combining them in a way that makes sense to your organisation and which work for the team that you have. Then you’ll be able to get insights from data, and that’s why we’re here, right? Spark, Python, DBT, and many other tools are just tools in our tool-belts. No great car was ever built with only a screwdriver, or only a hammer.

I do think, because the barrier of entry to DBT is so much lower and a lot more people know SQL than they know Spark, that DBT will in the end see more adoption. It democratizes data analytics even more. We already dragged it out of the Finance department, now we’re dragging it out of the IT department. One day, analytics will actually live in business departments. Imagine that. 😂

datamindedbe

Better data engineering