How Fivetran + dbt actually fail

Let’s build railroads in the sky

11 min readSep 19, 2022

In ‘Marge vs. The Monorail’ a mysterious outsider comes to a town hall meeting about Springfield’s budget surplus. He uses FOMO, slick-talking, and salesmanship to convince the town to create an infrastructure project in the sky — Springfield’s very own monorail. Photo Copyright: ©1993 FOX BROADCASTING

This past week I hosted the Get Stuff Done with Data Conference.

We had 260+ people drop by for at least one session over two days, with an average of 72 people in the Thursday sessions and 35 in the Friday sessions. Our two highest attended sessions were ones by Phil Hall and Ian Whitestone, on optimizations around BigQuery and Snowflake respectively, which should say something about where the discourse is going these days.

(I will post all videos and content shortly, separately.)

In a session with Decodable’s Eric Sammer called ‘Intro to Stateful Streaming’ we had a short talk at the end about what ETL/ELT means after Eric mentioned he’d recently gotten into a ‘spicy’ debate on Twitter about this. I pulled into the discussion Narrator’s Ahmed Elsamadisi who had the next speaking slot and was waiting backstage. He and I have discussed ETL/ELT many times and had even chatted about this in person over dinner a few weeks back.

Further, the same day Benn Stancil of Mode published an article about how Fivetran fails.

Then, dbt Labs published an article justifying their use of heavy ‘T’ complexity within their own organization’s dogfooded internal project.

It’s safe to say the ELT Kool-Aid is in full swing and defending itself.

Now, the following opinions expressed here are my own as a data engineer, systems architect, analyst, investor, advisor, consultant, and a bunch of other titles and labels I hold because adding titles and labels is how you gain credibility in the dark and shadowy world that is ~doin’ data stuff on the cloud~.

_____

Ultimately, ELT is way more heavily rent-seeking than ETL, especially the Fivetran + dbt combination.

As a rent-seeking operating paradigm, ELT was able to spread during the bull market run/low interest rate boom of the last two years.

However, as markets compress and the cost of capital increases, many companies simply can’t afford the ELT and will shift back to more ETL — essentially, more data management and schema management and potentially even processing data before the lake/warehouse, not just in it.

Let’s break down the economics of how we got here.

Let’s also take a look behind the scenes and understand the financials and incentives around this.

How Fivetran actually fails

Fivetran’s Shopify schema via fivetran.com

Here is how Fivetran works.

You, the customer, use Fivetran to connect databases and/or applications to a database/warehouse to replicate data to the target system.

Fivetran decides the table structure for how application data will be delivered to the cloud data warehouse.

Fivetran was historically priced based on the # of connectors one brought into the data warehouse. If you had, for example, Salesforce + Shopify + Marketo you paid for three sources.

Fivetran then adopted a ‘Monthly Active Row’ model, now offering a highly normalized set of tables derived from its connectors, essentially double, triple, and quadruple billing its customers for every single activity.

Of the more hysterical examples is the Shopify connector, which results in 60+ tables created upon usage.

If one order is placed, up to 12+ tables are affected immediately and Monthly Active Rows accumulate disproportionately to the actual real-world activity.

It’s comedy.

But the problem here is there is little inherent value derived from Fivetran’s landed tables — you are going to have to roll them up and inevitably do some amount of denormalization to use this for BI and reporting. Additionally, if you have multiple entities spread across several systems, you will have to resolve these entities in the warehouse as well. An example would be orders and order lines.

If you have order lines spread across multiple systems — like Shopify and say Braintree for a separate sales channel, now you will also bring in Braintree via Fivetran.

Now you have to resolve the entities based on your business logic, in SQL.

Also, now you’re doing finance in the cloud data warehouse.

Also, now you’re probably using dbt to manage the SQL joins, and you’re joining all the time, making railroads in the sky. You’re paying someone a salary to make the railroads in the sky and you’re accruing costs on one of the most expensive compute cycles possible every time this is run.

Also, there is no such thing as ELT.

There is a ‘T’ before the L in the ELT — the decision by the EL provider to create as highly normalized as possible landing tables.

This is beneficial to them because when priced on Monthly Active Rows it means more rows are filled, hence more revenue, and it creates more cloud credit attribution which can be used for increases in valuation.

Don’t even get me started on Airbyte over-normalization…

How dbt — the ‘T’ of ELT — takes rent-seeking to the extreme

The amount of fiscal irresponsibility and intellectual dishonesty that follow dbt around is astounding. The distribution has turned into a social media-led full-on get-rich-quick multilevel marketing scheme not unlike most other cults.

It’s so bad that dbt has content out there now telling people to change their job titles and come use them to play on the cloud.

No work, easy money, come play with a specific set of compute generating products to stick onto your cloud warehouse.

Of course, this accomplishes very little.

This is how the economics of the cloud work — the more people who don’t know what they are doing or are making tech debt on the cloud, the more money the cloud extracts.

Just by getting as many human beings as possible using dbt as much as possible, dbt is more valuable to the cloud as more compute is generated, and thus they can use this to increase their valuation and get more money to keep the party going.

The delta between people using dbt and people who know what they are doing is captured most directly by Snowflake or BigQuery or whatever engine is being run.

Further, by distributing bad practices or pretending like bad data practices and processes that are computationally expensive are normal, dbt Labs is able to fan more credit consumption.

This is all, of course, rent-seeking gone mad.

Data individual contributors seek rent. They want a bump in pay of $10k-$20k or so by adding ‘engineer’ to their ‘analyst’ title, then they use the pre-selected tools that were named for them.

Data teams seek rent. They want workflows offloaded to the cloud data warehouse in their jurisdiction, away from other operational teams. They want power.

Data team leaders seek rent. They want these workflows and they want to hire more employees underneath them to trade for compensation and/or title.

EL vendors seek rent. They are motivated to highly normalize the amount of landing tables for both revenue and cloud credit attribution.

dbt as the ‘T’ vendor seeks rent. When paired with an EL vendor, right off the bat dbt is used to roll back up at least some of the complexity caused by the fixed normalization structure.

Ultimately, this is all just the shuffling around of workflows and money, within individual companies and broadly across the market in aggregate.

Reports that could be done in an application are now passed through Snowflake, then put in a BI tool, incurring costs the whole way through.

With free money in the market this is allowed to happen and the middleware Clueless layer is allowed to bloat in terms of complexity, products, the ‘T’ of transformation, and human beings, which eventually devolves into serving as a bloated buffer between Data Producers and Data Consumers.

As I noted two weeks ago, you simply have to turn your cloud warehouse/Modern Data Stack 90 degrees and put it against the MacLeod Hierarchy to see the game.

When it all comes crashing down — ELT edition

Ultimately, dbt and the ‘T’ lifestyle of transforming data do not yield great returns for most businesses.

What started out as a simple tool for shaping some data around to support better BI use cases was over-distributed.

How much ongoing transformation of data is actually needed at most businesses?

Even dbt Labs doesn’t know what they are doing with data management.

A few weeks back dbt posted an embarrassing account of how slow and expensive their setup of ‘dbt+Snowflake’ actually is, claiming 1,700 ‘models’ of SQL sequences and also running window functions over their largest table, slowing everything down unnecessarily and sacrificing cloud credits unnecessarily, which as a heavily VC-backed company they can afford, at least for now.

Yesterday, dbt Labs posted another peek behind their curtain, this time claiming 700+ models (1,700 is technically 700+?) and again showing off that they overdid the ‘T’ in the warehouse with dbt and now they have an unmanageable mess that is as slow as molasses.

Additionally, there are questionable claims in this piece about using dbt for auditability with dbt snapshots and incremental models. I have referred to this type of thing many times as the ‘Shadow Finance’ that defines the operating paradigm — and it is a gigantic risk.

I am sorry if this is harsh or if it makes me a big meanie, but frankly dbt Labs should remove these recommendations as it is not at all how audit and custody of data works in regulated and audited businesses. This is extremely reckless of dbt Labs to promote the usage of their incremental models and snapshots as functions that meet the criteria of being auditable and traceable. It’s completely intellectually dishonest.

Trouble with a capital ‘T’

‘T’ is not at all nuanced and refers to many different things, from joins to unions, to entity resolution, to aggregations, to applying functions, to ‘cleaning’ the data, to shaping it for use with business logic.

In reality there are only two major steps needed here: entity resolution and business logic.

The further you get away from these, the more issues accrue. You start patching fixes to ‘clean’ the data in SQL instead of enforcing contracts and charging back upstream suppliers. You start writing new SQL dependent on the previous SQL dependent on the previous SQL, and so on.

Just like con artist Professor Harold Hill in The Music Man came along one day and sang a song about how there was an apparent ‘T’ problem in River City, so too have a number of opportunists come along and invented way, way, way over-solutioned fixes for what was frankly only ever a small problem to begin with.

Some of these opportunists are vendors, and some are VCs, yes.

But many are actually ‘Heads of Data’ working at companies where they want to make land grabs and win work away from other teams, most canonically in my experience Operations teams.

I’ve seen this over and over again in my career as an IC employee and manager and as a consultant.

Lots of data actually doesn’t need to be brought into a cloud data warehouse to be joined with other data.

The Cost of Doing Nothing is the cost of not knowing a thing and the Cost of Doing Something is very high, especially when you’re making DAGs on the cloud data warehouse.

Essentially, ELT supplants systems integration via the SQL join. That’s all it really is at the end of the day.

Lauren, who are you and what do you know about anything?

This one is funny. I have been asked this question many times by lots of the Kool-Aid drinkers, and the answer is simple: I am the white knight of the Modern Data Stack.

See, here’s how the scam works.

A company wants to be data-driven.

They hire some functional head of data or a staff aug consulting firm.

Never mind the fact that the staff aug consulting firms are run by guys who sit on the cap tables of the products they vend, that’s not important.

*Shh, don’t say that too loud or the whole thing may unravel.*

So a company goes down the ELT/cloud warehouse/Modern Data Stack rabbit hole, with their new functional head of data or staff aug consulting firm setting up dbt-to-infinity and all the typical bells and whistles.

Half the time, these functional heads of data spend an entire 3 or 4 months (or more) just noodling around with vendor demos and trials, making $250k or more a year.

So in 4 months this person noodles around, makes $83,333.33 in that 4 months, and still nothing is done yet.

Then, they make a blog featuring some of these ELT/cloud warehouse/Modern Data Stack toys they played around with.

Then it becomes time to get to work after 4 months.

They say they need to hire an analytics engineer to help them. So they get this approved because the business is hungry. They want to be a ‘data-driven’ operation.

Then, people who were Tableau or BI tool analysts read an article that features Prefect CEO Jeremiah Lowin telling them to put the term ‘analytics engineer’ in their LinkedIn profile and immediately they will be flooded with job offers.

So then they do this.

People actually do this, because of course they do, they want a quick buck.

I have genuinely seen this happen twice, and that’s just the ones I know about.

Then, because the functional head of data has no idea what they are doing, they hire this newly self-minted analytics engineer.

Then, they dbt-around on the cloud.

Then, at some point, now that 9–12 months have passed, the CFO or CEO or sometimes COO gets involved.

“I thought we were becoming data-driven?” they ask the functional head of data.

Really, all that has been produced is a bunch of dbt models, and if fed into a BI tool, one of two things happens. Either there are hundreds to thousands of self-service BI reports or there is basically nothing.

It’s 50/50.

So at some point, I get called.

“We need a second opinion,” says the CFO or CEO or sometimes COO.

So I go in and start evaluating and doing open heart surgery. It doesn’t matter whether they hired a functional head of data or a staff aug consulting firm, it’s the same thing every time.

Then, I go in, clean up a bunch of crap, kill the dependencies on dependencies on dependencies that have been made and deliver Mount Olympus which is literally just a basic set of dashboards and BI and analytics products that they needed the whole time, and I charge between $50k and $100k per project for this and just snipe these off all over the market because at this point these companies have hosed in the hundreds of thousands to low millions to get nowhere.

You don’t need all that much dbt.

You can do a little bit of dbt, but not too much.

You don’t need all that much of Reverse ETL and everything else.

You just need to simply get work done and keep things simple and you don’t need data teams sized in the dozens of people.

Go watch The Simpsons ‘Marge vs. The Monorail’ episode. It perfectly describes ‘analytics engineering’ on the cloud.

— — — —
How Fivetran + dbt actually fail: Part II is now published as well.

How Fivetran Fails: Part III is now published as well after Fivetran’s CEO seems to start Tweeting at me and then starts making stuff up about uptime. It’s all comedy.

Part IV is now published as well. We break down and go over some ridiculous NRR claims from Fivetran’s CEO and evaluate the Ponzi math and incentives around all of the Modern Data Stack.

— — — —