How to Organize your Data Lake / Warehouse — Miniseries

Episode 1: Technical-Canonical-Analytical Data Modeling

Flavio Altinier
Pier. Stories
8 min readJan 31, 2023

--

Photo by shawnanggg on Unsplash

This is the first of a series of 5 episodes describing the TCA (Technical-Canonical-Analytical) Data Modeling approach, an evolution over Star and Snowflake schemas for data analysis. During our journey we will go over:

Introduction

If you have ever searched online for literally anything about “how to generate value using your company’s data”, you have probably stumbled across some variation of this diagram:

Common Extract-Load-Transform diagram

This is the usual approach to building a Data Analysis pipeline — an ELT/ETL process moving data all the way from its generation to taking business decisions with it.

It is fairly easy to find a multitude of tools specialized in each of those arrows and boxes, but it is not always clear how to best use that tooling, especially at the Transform stage.

To make an analogy, imagine that you have decided to take up the hobby of painting. You could buy the best canvases, brushes and paint of the market — but if you do not know how to paint in the first place, your final results would probably be the same as someone using a far cheaper set.

Likewise, you can have the best ELT tooling out there, but if you do not know which tables to create, how to relate them, how to best organize your data, you will always end up facing the same pitfalls 90% of the companies out there: metrics showing up differently on distinct teams, data scientists spending most of their time cleaning data instead of creating models, and an overall sense of mistrust on business insights.

That is the purpose of this miniseries: to dive deep into that Transform box, and propose a novel form of modeling data and organizing your LakeHouse — a novel modeling schema we call TCA (Technical-Canonical-Analytical) Data Modeling.

The Problems with Current Data Modeling

Star and Snowflake schemas are by far the most used modeling techniques for preparing data for analysis in a Data Warehouse. Created by the genius Ralph Kimball some 30 years ago, these modeling schemas based on dimensional modeling soon became the main form companies use to organize their data.

To better understand how those schemas work, have a look at Kimball’s masterpiece, The Data Warehouse Toolkit. For the continuity of this series, we will assume the reader is familiar with those concepts.

As good as those schemas are, over the last decades technology and data needs have evolved, and many feel like the Star and Snowflake schemas are beginning to feel their age. Let us see some examples.

Dimensional-evolution tracking

Say you are trying to train a time series statistical model (even a pretty simple one like an ARIMA): you have to make sure that all the data you feed the model for a particular timestamp is the data that was available at that time — not the data available at the current timestamp.

However, how can you do that when the data in the Dimensions in your Star Schema are changing all the time?

There are some techniques people implement to prevent confusion on mutating Dimensions, and the most famous one was popularized by one of the best Data Engineers out there, Maxime Beauchemin, with his Functional approach to Data Engineering. He proposes a new snapshot of the Dimension should be created at every ELT run, and then appended at the end of some gigantic Dimension table.

That works if you know what timestamp you are looking for in your subsequent analyses — however, if you need to know when a Dimension changed, you will find yourself in a deep loophole making diffs between Dimension snapshots.

We believe deeply in Beauchemin’s idea of Functional Data Engineering, and it is the foundation of TCA Data Modeling — it just does not fit that well with the Star Schema.

What comes after the Star Schema

Current methodology looks closely at the Transform stage with a single focus: transforming systemic data (organized to make sense to a system) into analytical data (organized to make sense to a human). And it does that quite nicely.

The real problem comes after that stage: for most BI Tools, feeding such a normalized data schema into it is unfeasible. So data analysts must use that pre-prepared schema to produce their own tables to feed into the tooling, and all that neat organization easily falls to pieces in those subsequent data regions.

Without governance and clear organization rules, multiple tables created by multiple people answering the same questions start to pop up, possibly with different agreements on what their goal metrics are. Pretty soon, the Data Warehouse starts looking something like this:

Common Data Warehouse implementation

Clearly, there is room for improvement at that last stage.

PS: in the diagram, we have named zones inside the Data Warehouse. That is also a common technique companies like to implement — however, it is hard to find a standard to that naming. Same call it bronze-silver-gold zones, others call it transient-raw-trusted zones. As there is no agreement in naming, we have used generic ones in the diagram.

Adapting non-factual data to the Star Schema

One clear caveat of current approaches at dimensional modeling is that it is difficult to model non-factual data.

Factual data can easily be fit into a Star Schema. Imagine an e-commerce, for instance: purchases are Facts, data about the purchases are Dimensions. Most of the analysis can be made with just that simple abstraction.

However, imagine a business that runs on a subscription model instead of on an item purchase model.

Transforming that data into Facts and Dimensions is pretty hard. The most obvious modeling suggests designing the purchase or cancellation of subscriptions as Facts, but answering even the most basic business questions with that design can be quite painful: say you want to know how many members were subscribed on a given date — with that design, you would need a non-trivial query to answer such a trivial question.

As powerful as factual and dimensional modeling is, it is not a silver bullet for every business situation.

Access Level Governance

This is probably the topic that has best developed over the last few years regarding tooling — today, most specialized tools out there enable governing access at the level of rows and columns.

Technology has made Governance super granular — deciding which rows and columns a user should have access to, however, can be extremely cumbersome. It is easy to find yourself lost in a hard to govern sea of access policies created by multiple people. It is actually kind of a contradiction: Governance is easy, but governing Governance is almost impossible.

We believe the problem arises because with so much freedom, it is tempting to create access levels in a freestyle approach, managing accesses when needed. But what if we could standardize governance, so that when a table is created, we know exactly which people should have access to it?

Maintaining Transformations

Current approaches to Data Modeling usually oversimplify the Transform stage. Most organizations prepare their LakeHouses using a few gigantic scripts which are nearly impossible to maintain.

As a best practice, data transformation should go through multiple simple phases before ending up on a Star Schema. The end result is the same, but maintaining simple contained scripts is much easier than a single one of ten thousand lines of code.

The Transform stage should be made in steps

It is not always simple, however, to design what those intermediate abstractions should be. With TCA Modeling, we propose a novel framework on data design such that each one of those steps is well defined.

A New Alternative: TCA Data Modeling

Current modeling techniques are not terrible: they are just starting to get outdated, depending on your use case.

The purpose of TCA Data Modeling is not to try and reinvent the wheel. Its goal is to build a new perspective on dimensional modeling, keeping what was good from traditional techniques and upgrading on the aforementioned problems.

TCA Modeling divides the LakeHouse into multiple Regions and Zones — and the data in each Zone must respect a very strict set of rules. Without further ado, a LakeHouse using TCA Modeling looks something like this:

Overview of TCA Data Modeling

It looks complicated, but over the course of the next Episodes, we will describe each of those Regions and Zones in detail. For now, let us just have a look at a quick summary of TCA Modeling’s assumptions and rules:

  • The Data LakeHouse is divided into three different Regions (Technical, Canonical and Analytical). Each one of those regions can be further separated into multiple Zones.
  • The Transformations of data from Zone to Zone should follow reading order — from left-to-right, top-to-bottom.
  • Usually, every Zone should be composed only of materialized tables. Some very specific technologies may work best with Views in their place, but that would be uncommon.
  • Every Transformation in the Technical Region should be generic and absolutely no business logic should be needed there. If a new data source arrives in the Transient Zone, its data should transit flawlessly all the way to the Rebuilt Zone.
  • On the other hand, transformations to bring data into the Canonical and Analytical Regions are deeply dependent on business knowledge. That division makes it easier to decide which kinds of professionals should work on which kinds of data.
  • Data in the Canonical Region should be immutable. This is the most important Region in TCA, and we will dive deep into it in Episode 3 of this miniseries.
  • Data in the Canonical and Analytical Regions should be thoroughly documented and tested for inconsistencies, both schematic and for business logic.
  • The whole idea behind TCA Modeling is that it should be agnostic to ingestion and transformation frameworks — Lambda or Kappa architectures should fit nicely with it. It also aims to standardize transformations, making it easier for those companies aiming to maintain a true Data Mesh.

Hopefully, that quick overview of TCA has instigated the reader’s curiosity. Over the next articles in this series, we are going to dive deep into each one of these Regions and Zones. We are going to explore:

  • What kind of data should be in each Zone, with examples
  • The algorithms used to transform data from one Zone to the next
  • Which kind of professional should be responsible for each Zone
  • Access Level Policies for each Zone

In the last article, we will also explore some extra content:

  • Additional Zones that may fit well with TCA depending on your needs
  • More use cases and examples!

Up Next -> Episode 2: The Technical Region

--

--