Data about data from 1,000 conversations with data teams

Mikkel Dengsøe
7 min readMay 14, 2024

--

Since starting Synq two years ago, we’ve spoken to over 1,000 data teams. They have something to say about what’s on their minds.

Proof that we’ve, in fact, spoken to more than 1000 data teams

Three trends overshadow everything else.

  1. The data warehouse is no longer just for reporting
  2. Data teams (and their stacks) are getting larger
  3. There’s not a one-size-fits-all testing approach

Let’s dig in

The data warehouse is no longer just for reporting

The primary use case of the data warehouse used to be reporting and analytics. While this use case hasn’t disappeared, business-critical use cases have exploded, from AI/ML to automated marketing and regulatory reporting.

In other words, the data warehouse’s stakes have increased by several magnitudes–from important to business-critical.

If you support important use cases, data issues often mean incorrect decision-making and the loss of trust in data over time. Not to be underestimated.

But if you own business-critical data, data issues are hair-on-fire business problems. An eCommerce company may spend millions of dollars weekly on advertising based on customer lifetime value prediction (CLTV). Even a few hours of incorrect data can result in $100,000s of lost revenue.

Data teams are paying attention to this, and poor data quality is the top problem, as rated by data professionals in dbt’s 2024 State of Analytics Engineering. The most evident example of companies investing in this is the rise of companies using dbt — increasingly so in enterprise companies — with dbt seeing year/year revenue growth of 76%.

Impact #1–The rise in business-critical data warehouse use cases is causing data teams to operate more like their software engineering colleagues.

The software engineering mindset data teams adopt can be broken down into five areas.

  • Diligent testing — data needs to be tested across several dimensions, such as correctness, completeness, consistency, and freshness
  • Ownership management — data assets must have clear owners assigned who are notified of issues and are expected to act
  • Incident process — severe issues should be treated as incidents with clear SLAs and escalation paths
  • Data product mindset — the entire value chain feeding into key data use cases should be considered as one product
  • Data quality metrics — the data team should be able to report on key metrics such as uptime, errors, and SLAs of their data products

Data teams (and their stacks) are getting larger

Gone are the days when one person could keep the context of all tables, dashboards, and data products in their head. dbt shared the infamous statistic around the percentage of dbt project installs based on the number of models. 20% of projects have more than 1,000 models. 5% have more than 5,000. That’s more than 1,000 projects with 5,000+ models!

While the average number of data assets is growing fast, data teams are also consolidating, and very large teams in the enterprise are becoming more common. Siemens went on stage at last year’s Coalesce and shared how Siemens uses dbt cloud and the data mesh and its impact on the company-wide data evolution.

Watch their talk

The initial reaction from the audience was, “Oooh, this is a whole different scale.” A few things that stood out from their setup

  • 800+ dbt projects in a single dbt instance
  • 550+ active dbt developers
  • 2,500 daily jobs running
  • 85,000 dbt models

They’re not alone. The median data team as a percentage of the total workforce for select growth tech companies in the US and Europe is 3%, but many teams are starting to exceed this significantly.

Data team as % of the workforce for 50 European and 50 US tech scaleups. Source: Linkedin data

Breaking this down by looking at data people relative to the number of engineers and designers gives an interesting perspective. Grouping the companies into B2B, B2C, Fintech, and Marketplace/physical goods, they align almost perfectly in clusters with just a few outliers. The contrast is stark between the groups; for example, the data-to-engineer ratio for marketplaces / physical goods is 2–3 times higher than that of B2B companies.

Data to engineers and designer ratio for 50 tech scaleups. Source: LinkedIn data

In B2B, companies have fewer, larger customers, and it’s more difficult to run A/B tests and deploy machine learning models at scale. Often, differentiation happens through sales and engineering efforts. However, in a marketplace company, margins are tight, and using data is how they stay ahead of the competition. A great example is Deliveroo, which created a dispatch engine to match the best combination of riders with customer orders automatically.

This is just the beginning, and other verticals are quickly moving to be data-first, too.

Tristan Handy, CEO of dbt, has articulated in the article the next big step forwards for analytics engineering how rapid scale feels when working in a data team:

  • Velocity and agility have slowed, creating frustration inside and outside the data team.
  • Collaboration becomes harder as no one is familiar with the entire code base. Time spent in meetings goes up relative to time spent getting things done.
  • Quality becomes harder to enforce over a growing surface area, and user-reported errors increase.
  • SLA achievement declines as more jobs fail, but no amount of retros seems to reverse this trend.

Impact #2–Data becomes exponentially more difficult with scale. Top data teams invest in ownership and regular cleanups and are intentional about what data is most important.

The following four steps consistently have a high impact when addressing scaling issues.

  1. Use ownership intentionally — consider ownership both inside and outside the data team and start simple (read more: Data ownership: A practical guide)
  2. Cleanup unused data assets — make it a habit to regularly evaluate and remove unneeded data assets such as dashboards with low usage, data models with no downstream dependencies, or columns in data models that are not used downstream (read more: The struggles scaling data teams face)
  3. Define your important data assets — be deliberate about which data assets are most important based on their downstream use cases and implement processes to build quality by design (read more: How to Identify Your Business-Critical Data)
  4. Define interfaces — put in preventative measures such as data contracts and versioning to prevent upstream teams from introducing unintended breaking changes (read more: The next big step forwards for analytics engineering)

There’s not a one-size-fits-all testing approach

When we start working with data teams, they look like this more often than not.

Dozens of failing tests, with no idea which ones are important

What started as good testing intentions have led to dozens of tests failing, and it’s unclear which issues are important and how long they’ve been erroring.

The combination of more business-critical data and a growing number of data assets has caused this.

Worse, important issues still slip through the cracks despite hundreds of tests. While basic tests such as not_null and unique may help catch the most obvious problems, most real-life issues are more nuanced.

Impact #3–Testing should be done intentionally, increasing the importance of data professionals closely understanding their business domain.

Below are two examples of what intentional testing can look like in practice.

Example 1: A marketplace company relies on predicting the price before purchasing cars. Airflow extracts data from the final data mart each week and sends it to S3 for the machine learning model to train on. When retraining the model, accurate data is more important than fresh data to avoid model bias. To prevent the model from training on partially complete data, the data team introduced a set of tests to catch issues and a pricing_model__export model extension, which will only be updated if all tests pass on the previous model, helping secure the accuracy of the input data.

Example 2: A media group relies on data from thousands of third-party providers, some of whom send a few thousand records daily and others hundreds of thousands. It’s key to detect if some partners are missing data on any given day so the partner managers can contact them. Only by monitoring the row count of each partner and comparing that against the expected range can the company catch these issues proactively.

These are just two examples, but they show the importance of thinking nuanced about the data and how it’s used as you add tests.

What the future looks like

I believe we’ve only scratched the surface of the full potential for the data warehouse. The data warehouse has the opportunity to be the control center for companies. It will expand from analytics and become the core of sales, operations, finance, etc.

The demand is there, the use cases are there, and data teams are motivated to step up and own business-critical data. However, the real test will be whether they can build the trust that gives the rest of the organization faith that they can be relied on to operate these systems or if they will instead move to their engineering colleagues.

--

--