Stop the Madness! 1000k data warehouse tables got created out of 6k source tables in 2.5 years — Part 1

The root cause of data explosion — Dimension tables are the trouble maker

Lori Lu
Kyligence
4 min readMar 15, 2022

--

The Muddy DW of an Internet Giant in China

The Madness behind the 1000k DW tables

Those are the real numbers shared by one of China’s tech giants. Their homegrown data lineage platform shows there are 6,000 source tables collected from upstream operational systems, but 1000k tables have been created in the downstream data warehouse since H2 2019. In other words, 99.4% of data stored in their data warehouse is redundant. They would expect tens of millions of cost reduction if proper data governance was in place.

In extreme cases, one of the base tables, Orders, has 10k direct descendants.

Why the Madness?

If you’ve read my previous blog— BI Dashboards are Creating a Technical Debt Black Hole, you’ve probably cottoned on to the fact that this case is just a real-world example of technical debt produced by BI workloads.

The direct cause of this problem is that business teams are working in silos, and data pipelines and metrics are barely reused across all downstream data applications.

However, this is counter-intuitive because, in the most general sense, most teams want to leverage other teams’ work so they can move faster and turn data into actionable insights almost instantly. However, the hard fact is telling us something opposite.

Now, let’s dig deep to find the root cause.

Root Cause: Dimension Tables are the Trouble Maker, rather than Fact Tables

Facts are facts — a fact table consists of facts of a particular business process, e.g., sales revenue by month by product. This means fact tables are less prone to change, and there is only one correct version at all times.

However, in the absence of dimension tables, a fact table is simply a cluster of meaningless numeric values. With the business context provided by dimension tables, business analysts own the magic bullets which empower them to see facts through different lenses. What they normally do is to create many dimensions of their unique style and then blend facts with dimensions to form metrics and support decision-making. This is definitely the most fun part of their job.

Now, their way of working brings the challenge to us — a dimension table may have many sources of truth as each business stakeholder keeps their own version. For example, a typical date dimension table may have the following variants: a standard 52-week year, a standard calendar plus fiscal date for the finance team, a 4–4–5 calendar for the US team, etc. These date dimension tables exist for good reasons - each of them has some unique fields others don’t hold even though the rest of the columns stay exactly the same.

Looking into this e-commerce giant’s data warehouse, we discovered that many dimension tables are just duplicates of another or a copy of the original one with some dimension attributes that are slightly twisted to meet business needs.

Chaos Theory shows how small alterations can lead to consequences of much greater scale…

This is why the underlying data pipelines cannot be reused — fact tables are shared and in sync, but dimension tables have many versions, and no single team can align all teams to follow their standard.

Shared Facts + Unlimited Non-Standardized Dimensions

=> No REUSE

=> Data Exploded

Solution: Dimensions are the first-class citizens to be “governed”

Talking to many other customers, we found that this is a universal but unaddressed issue. Many of them start to feel the pain of unmanageable data explosion caused by overlooking the impact of non-standardized dimension tables. In the process of making data pipelines “use-case-ignorant” and reusable, we believe that dimensions should be standardized and unified first. More importantly, with shared data pipelines in place, a solid foundation has been laid for standardizing metrics across the business.

Standardize Dimensions First, Then Standardize Metrics.

This echoes the idea that the Airbnb metrics team mentioned in this sharingDimensions are the first-class citizens. Here is their reasoning:

  1. A lot of metrics are just dimensional cuts of other metrics;
  2. A lot of metrics have hundreds of dimensions that people might care to cut them by.

Airbnb’s story highlights the feasibility and importance of standardizing dimensions:

  1. Dimensions are reusable because “a lot of metrics are just dimensional cuts of other metrics.”
  2. Dimensions might become unmanageable because “a lot of metrics have hundreds of dimensions that people might care to cut them by.”

What’s next

So far, I’ve barely scratched the surface of this data explosion issue or a broader challenge — how to standardize metrics in an organization. I’ll share some best practices for governing dimensions in the following blog. Stay tuned!

Leaving feedback:

I’d love to hear back from you on this topic.

RELATED ARTICLE

BI Dashboards are Creating a Technical Debt Black Hole

3 Key Principles for Reducing Dimension Tables in your Data Warehouse

Thanks for reading!

Please share, subscribe to my email list, or follow me on Medium for upcoming blogs.

--

--

Lori Lu
Kyligence

Data, Strategy & Planning | Restaurant Industry