I spoke to some of my fellow data friends about dbt, and how it made our ELT pipelines more manageable. Not everyone agreed with our approach.
“As data volume and variety grows, so do your pipelines. It will be too much to maintain, and bring you back to the mess you started.” — Analyst A.
“Clean, reusable datasets are slow to build because it needs everyone’s consensus to produce. It’s easier to create their own datasets.” — Analyst B.
They’re not wrong — our pipelines grew so big, it takes forever to build models. We’ve also learned valuable lessons and managed to address some issues. I’ll be sharing these lessons here. If you’re new to dbt, please visit this 3-part series on how we moved our old pipelines to dbt.
Lesson #1: Scale models fast.
In just 7 months into 2022, the analytics team headcount tripled from 6 analysts to 18 to meet the company’s growing data demands. We have more business units, more stakeholders, and more use cases that require clean datasets for dashboarding and performance analytics. That means more dbt models to build. A lot more.
There are two approaches to handle the demand: build dbt models carefully while meeting all best practices (more proper), or build them fast and messy, and worry about tech debt later (as long as the data is correct).
NOTE: Examples of messy pipelines in our case means: ensuring dbt models follow a correct flow, we’re not creating duplicated column processes, SQL queries are fully optimised, etc.
Better to have messy pipelines than angry stakeholders. 💢
I learned that while it’s easy for me to follow best practices, it may not be the same for other analysts to align, especially with a big team. Analysts may have different thought processes, different stakeholders (demand), query complexity may differ, and we choose to build models the way we know. It would be counterproductive to force everyone to build pipelines in accordance to best practices 100% of the time; there are cases where tight deadlines are to be met, new analysts need time to learn dbt, etc.
Pushing for proper (not messy) pipelines would mean analysts need even more time to complete the task, and slower to meet our stakeholders’ requirement. When pipeline delivery is late, business decisions are late, or missed opportunities may happen.
Our dbt pipeline follows a strict model naming convention i.e. stg_
, int_
, final_
, and house rules such as int_
dependencies can only come from stg_
or other int_
models in order to make the pipeline readable and predictable. Sometimes, due to tight deadlines, we choose to build without following the convention, e.g. an int_
table with final_
as dependency. This makes the pipeline confusing / difficult to read (especially for other analysts), and that’s OK.
If you think about it, each dbt model serves as “technical documentation”, explaining how we process certain columns, so it is still possible to refactor the pipeline in the future (with some time to understand the mess we created). This puts us in the mindset where we treat our dbt project as a codebase as how software engineers do; we follow best practices we set, and occasionally know when to break best practices, but we always set time to clear our tech debt.
Don’t skip quality checks. ✔️
In a way, encouraging analysts to ship dbt models fast yields better results, but this does not mean we’re OK with negligence. Minimum data quality checks are still implemented like peer reviews and unit testing, and document your thought process somewhere for references.
Data quality checks reduce chances of data discrepancies, and allows other analysts to safely reuse some processed columns for other dbt models. Because we build them fast, we’re not a potential blocker to other analysts who need our data for their use cases, especially in an environment where collaboration from different teams happen frequently.
For our team, we’ve produced almost 700 dbt models in 6 months, and often ref()
our colleagues dbt models to process something of our own, so eventually we get some spaghetti pipelines. Good news is stakeholders rarely raise issues about data discrepancies; if they do, it’s usually solved within a day. The outcome is even if the company scales up teams fast, we’re able to keep up with the data demand at a reasonable pace.
UPDATE: the downside of this approach is we build too many models too quickly to a point we noticed there is significant cost if we continue to scale at this pace. Our long dbt run
runtime made it obvious that it was time to clear some debt, which I will share in the next article.
Lesson #2: Document. document. document.
When we build pipelines fast (and ensure data quality), we need to write down the technical debt we created today, and remind ourselves to fix it in a timely manner.
Write technical documentation first… 🔧
Earlier, I mentioned that dbt models itself is technical documentation. Other than keeping dbt models small and readable, we encourage each other to push a complete table schema in the .yml
file along with new dbt models we created, so other analysts have enough info as reference when they need to reuse some columns in our model.
We use Amundsen as data catalog tool for data engineers and analysts, and a Slackbot to remind ourselves to document any undocumented dbt models. We also encourage analysts to add comments in dbt models to clarify our thoughts when processing certain columns. Rather than commenting what or how a column is processed (we can infer from the query), comment why it was done that way.
The goal here is to minimise number of occurrence where an analysts reads someone’s dbt model and unsure if they understood what a column represents, and needs to @here
in our Slack channel to seek clarification, but ended up waiting half a day to get a response because we’re ̶n̶o̶t̶ ̶w̶o̶r̶k̶i̶n̶g̶ ̶d̶u̶r̶i̶n̶g̶ ̶W̶F̶H̶ slow to read Slack messages.
…and business documentation later. 📔
That covers the technical documentation, but it is difficult to write comments on business context behind some column processing in a dbt model. Let’s say we need to process a column in transactions
table called revenue
, where the calculation isn’t as straightforward as commission - cost
. It has several if/else logic where you need to apply a different calculation if column commission_percentage IS NOT NULL
, but you'll need to check if what’s the value in columnmax_commission
… you get the idea.
We choose to document business context in our Confluence space. It’s easier for everyone to describe as detailed as possible about the business context and calculation logic (including flowcharts too) in a single space, so everyone knows where to find it rather than Google Docs which may be disorganised, and save us the trouble of setting reading permissions to everyone.
The order of documentation priority (high to low) I would recommend is:
- comments in dbt model (include JIRA ticket ID if necessary).
- table schema in
.yml
file. - business logic and calculation in Confluence (also include JIRA ticket ID if possible).
Business documentation could take awhile (maybe a full day), so it doesn’t make sense that this would be a blocker from releasing the dbt model. Of course, it requires discipline to remember to document it as early as possible; sometimes we might put that on hold to work on other deep dives or dbt models. For our team, documentation is actual work (with JIRA tickets and story points). Spending a day could save us a few days worth of trouble remembering why we built the model in such a way, and embarrassingly ping other colleagues “remind me again why we did so and so…”.
Lesson #3: Clearing tech debt periodically.
Now that we’ve created significant amount of debt, it’s critical to prioritise some debt clearing tasks once in a while. Ideally we don’t want to wait till the debt grows too big and affects data warehousing cost and the pipeline becomes too complicated, but in a typical fast-growing tech company, there is very little chance data analysts / analytics engineers have time for addressing tech debt.
Translate tech debt to business cost. 💸
While the most straightforward method for stakeholders to understand why we need to clear tech debt, many of us (including me) fail to show reasons that we need to do it now. A common response we get is “if it’s not breaking, then fix it later / when it breaks.”.
One method we tried was quantifying debt into monetary terms. Some examples:
- if we complete task X, we can reduce data warehousing cost by 15%.
- improvement Y will shorten the average task completion duration by 1 day.
Tasks that reduce duration is a bit tricky to translate; people would think if we can finish the pipeline quickly, we should be able to handle more work, and that wasn’t our intention. Another way we can do is translate the outcome (of not clearing debt) to actual cost e.g. if we don’t perform this task, we will require additional headcount (cost) to meet the demand.
Perform debt clearing tasks in small, frequent batches. ⏱
If I’m a stakeholder, what the analytics team do with their time is not my biggest concern, but if they propose a few tasks that would keep the team busy for a few sprints, that could mean my tasks will be deprioritised. Breaking down the tasks smaller so it can be done quickly should allay some of your stakeholders’ concerns.
Of course, this also depends entirely on what kind of tasks , and whether we can actually break it down to smaller batches. That goes back to the original point of clearing tech periodically, and making time for it. As we scale pipelines quickly, we have to be aware the debt we’re creating (put notes if possible), and set time to clear them rather than putting it to the next sprint — so you are able to scale more pipelines quickly again. This cycle is how we maintain our pipelines to be lean enough, yet capable of meeting our stakeholders’ data requirements.
Next Lesson: keeping our pipelines lean.
I’ll be sharing in detail what projects we did to maintain lean pipelines, and how we address tech debt with tight deadlines in the next article.