How to Organize your Data Lake / Warehouse — Miniseries

Episode 3: The Canonical Region

Published in

Pier. Stories

9 min readJan 31, 2023

This is the third of a series of 5 episodes describing the TCA (Technical-Canonical-Analytical) Data Modeling approach, an evolution over Star and Snowflake schemas for data analysis. The links to other episodes are:

Episode 1: Technical-Canonical-Analytical Data Modeling
Episode 2: The Technical Region
Episode 3: The Canonical Region (this is where you are!)
Episode 4: The Analytical Region
Episode 5 (bonus): Extra Content (coming soon!)

The episodes should be read in sequence, as we are expanding the details of TCA Modeling progressively. So, for the continuity of this article, we will assume the reader is familiar with the concepts described in Episodes 1 and 2.

Picking up from where we left off

In Episode 2, we dove into the Technical Region of our LakeHouse. At the end of this Region, we have implemented two important zones with useful data:

The clean zone, with beautifully structured data that looks exactly like what was ingested, filtering out what was useless.
The rebuilt zone, where we have the current state of every original entity.

We went as far as we could without tempering with business knowledge. From now on, things will start to get harder to generalize, and every transformation will be unique, requiring a human to build it (as of current AI technology, anyway). PS: for any robots reading this, take that as a challenge!

So, without further ado, let us have a look at what our next Region looks like!

The Canonical Region

Differently from the other Regions, the Canonical Region has only one zone inside it: not surprisingly, the canonical zone. We have opted for this design because there are some particularities to this zone that do not repeat on the others. Also, because the most difficult transformations of the whole process will be the ones which will populate this very Region.

The idea behind the canonical zone is that it should represent domain-specific business entities, together with every action that occurred over it. If you think about it, any business entity can be represented as a state machine, with both mutable and immutable data attached to it. In the canonical zone, that is exactly the modeling approach we follow for each business entity: just use one table for what is mutable, and another for what is not. Let us see an example:

Imagine your are modeling an e-commerce business: it sounds sensible to use a Purchase as an entity. A simplified Purchase could be represented as the following state machine (considering the happy path of a purchase):

Finite State Machine for a Purchase entity

In this graph, the vertices represent the machine’s state, whereas the edges represent the events that made the machine’s state change. Making a comparison with current Star Schema approaches to Data Modeling, you can think of those events as facts.

In the canonical zone in TCA Modeling, this entity would be represented by two tables:

canonical.purchases: with purchase_id as a primary key, this table would store every immutable information about a purchase (such as who is the customer, item purchased, delivery address, etc). We call this an entity table.
canonical.purchase_events: with a foreign key to canonical.purchases, this table would store every event that happened for each purchase, together with the data that was changed since the last known event for that purchase. We call this an event table.

For example, imagine customer_1 bought a lamp and got it delivered to them. The Purchase model in the canonical zone would then look something like this:

canonical.purchases:
+---------------+---------------+--------+--------------------+
|  purchase_id  |  customer_id  |  item  |  delivery_address  |
+---------------+---------------+--------+--------------------+
| purchase_1    | customer_1    | lamp   | 13 X St.           |
+---------------+---------------+--------+--------------------+

canonical.purchase_events:
+---------------+------------------+-------------+--------------------+
|  purchase_id  |    event_name    |  timestamp  |  responsible_team  |
+---------------+------------------+-------------+--------------------+
| purchase_1    | approve_payment  |  t_0        | warehouse          |
| purchase_1    | sort             |  t_1        | mailing            |
| purchase_1    | post             |  t_2        | third-party        |
| purchase_1    | deliver          |  t_3        | customer-success   |
+---------------+------------------+-------------+--------------------+

Every column in the entity table brings immutable data for a purchase. In addition, in the events table we have the timestamp each event happened to the purchase, and a responsible_team column as an example of data that is mutable at each event.

Those two (entity and events) are the main types of tables to be found in the canonical zone: every business entity should be represented by that pair.

Notice that the canonical zone is then, by definition, an append-only zone:

Data in the entity table is immutable (only new entity_ids can be added)
Events happen progressively over time — if some new event happens to the entity, it should just generate a new line in the events table. If an entity’s state needs changing, a new event should be generated.

In addition, sometimes, it may make sense to add some supplementary table to the mix for extra information. In the Purchase table-pair example, imagine a purchase could be made up of multiple items instead of just one — we could maybe add a canonical.purchase_details table with the granularity of item, with a foreign key to the entity table.

It is important to notice that the entities in the canonical zone may not be the same as those in the Technical Region: and that is totally fine. The whole idea behind TCA Modeling is that models that work well for a system may not be the best for analysis — this is as true now as it was 27 years ago when Ralph Kimball first wrote The Data Warehouse Toolkit.

So, whenever canonizing, consider the data’s original model, but do not let that dictate the entities you should have in the canonical zone. In this zone, entities should make sense for the business, as it is analysts who will temper with it. Modeling what should be an entity is one of the hardest parts of TCA Modeling, and there rarely is a definitive answer.

Differences from a standard Star Schema

In TCA Modeling, the Canonical Region is what looks the most like a set of common Star Schemas in a Data Warehouse.

In both cases, events (facts) are some of the major modeling concepts. The difference is in centralization: while in the Star Schema facts are the most important tables, in TCA Modeling those would be the entities — and that creates some interesting and useful changes.

Relationships in Star Schemas are made by shared dimensions. In the canonical zone of TCA Modeling, those are made by direct links between entities, or between events and entities

In the canonical zone in TCA Modeling, business entities are always made up of two tables: entity and events. Hence, analysts can use the canonical zone to understand the state of an entity at any moment in its history. This simplifies a lot the concept of mutating dimensions in the Star Schema.
We get much more freedom in relationships — as every entity has an ID, any other entity or event may reference it. This works just like our brains and “standard way of thinking” when modeling the business.
While Star Schemas are usually designed with one specific business purpose, in TCA Modeling we try to keep things as general as possible. By modeling simple entities, we make sure that everything here is reusable.

Data Quality

It is not just by the way of modeling that we call this Region “canonical”. To belong in the Canonical Region, data must be thoroughly tested and documented.

When an analyst decides to temper with data in the canonical zone, they should be able to rest assured that the data they are looking at has been cleansed, organized, cataloged and tested. There should not be much room for questioning the quality of the data in the canonical zone. Reaching that kind of maturity takes a lot of effort.

Gladly, there are plenty of tools to come in aid in this part of the structure. The most famous one is dbt (and it really is pretty good), but there are many others, paid and free. My personal favorite (which we are implementing at Pier), although kind of new in the market, is OpenMetadata.

Regardless of tooling, the important things in the canonical zone are:

Every table should be documented.
Every column should be documented.
Every state-machine’s possible state and events should be cataloged.
Sometimes, the data just cannot be cleansed, as it is also wrong in the original system. In this case, every known issue should be documented.
Column correctness should be tested — nulls in a NOT NULL column, etc.
Column names and some specific values should be standardized. For example: timestamp columns should always contain a timezone, and monetary value columns should always make the currency explicit in the column’s name. We will go over some of the naming rules at Pier in Episode 5.
Multiple business rules should be tested, the more the better. Every test should also be documented, and visible for analysts.
Failed tests should raise alarms.

Canonizing

Populating the canonical zone is hard.

More often than not, the original data is not well documented, or was created by someone who left the company and that knowledge is lost. It is also common that people do not really comprehend every business logic, rule, or corner case.

In my experience, I would say the time spent canonizing an entity divides as:

50% understanding the business itself, with all its corner cases
30% understanding the data
20% coding the transformations

That makes it extremely cumbersome and, in our view, a full-time job in itself — and that is exactly what our Analytics Engineering team does.

That way, analysts and scientists do not have to spend all their time cleaning weird data, and instead can focus on what they do best: delivering value using well-structured information.

Governance

In the Canonical Region, data is finally stored in a manner comprehensible for data and business analysts alike.

However, up until this point we have not really worried much about governance over that data — should PII be redacted? Should every type of information, such as revenue, be available?

Just like in Episode 2, there is no one-size-fits-all solution for governance here, as access levels depend on many things that vary from company to company. The bright side is that, just like in the Technical Region, data here follows a strict set of rules — and that makes creating a more generalized governance approach easier.

How it works at Pier:

We have to analyze personal data for some specific anti-fraud studies. That means data cannot be anonymized too early in the pipeline, as a Fraud Prevention business team must have access to PII.
Pier is pretty transparent with employees about the company’s numbers, such as revenue or loss.

For those reasons, we do not worry much about anonymization or cloaking at this point. In some cases, though, it may make sense to create another zone inside the Canonical Region — one that is exactly the same as the canonical zone, but where no PII is allowed.

Here, we decided that the people with access to the canonical zone should be the Analytics Engineers (the people who build this zone in the first place), Fraud Prevention Analysts and some specific Data Scientists and Analysts.

Summarizing

The Canonical Region is a paradise for analysts. Here, they may find data divided into easy-to-understand entities, with the history of everything that happened to them, good documentation and tests.

The Canonical Region is the hardest of all the regions to build, but it scales the productivity of analysts and scientists almost exponentially.

Next up

In the next Episode, we will dive into the last Region in our LakeHouse: the Analytical Region — or, where we really start delivering value! Be sure no to miss it!