How to Organize your Data Lake / Warehouse — Miniseries

Episode 4: The Analytical Region

Published in

Pier. Stories

9 min readJan 31, 2023

This is the fourth of a series of 5 episodes describing the TCA (Technical-Canonical-Analytical) Data Modeling approach, an evolution over Star and Snowflake schemas for data analysis. The links to other episodes are:

Episode 1: Technical-Canonical-Analytical Data Modeling
Episode 2: The Technical Region
Episode 3: The Canonical Region
Episode 4: The Analytical Region (this is where you are!)
Episode 5 (bonus): Extra Content (coming soon!)

The episodes should be read in sequence, as we are expanding the details of TCA Modeling progressively. So, for the continuity of this article, we will assume the reader is familiar with the concepts described in Episodes 1 through 3.

Picking up from where we left off

In Episode 3 we went into the details of the Canonical Region, where our mindset about data changed in comparison to previous zones: models now should follow business rules, not systemic rules.

In the Canonical Region, every business entity is represented by a pair of tables:

An entity table, where every piece of immutable data about an entity is stored
An events table, storing every event that happened to an entity, changing its state.

The Canonical Region is the toughest to build and maintain, but it makes life a lot easier for users of this next Region we will dive into in this episode: the Analytical Region!

So, without further ado, let us have a look at what this Region looks like!

The Analytical Region

This is where we finally start delivering value to the business using data. Every zone in the Analytical Region may be useful for a different kind of stakeholder — but the important thing is that these zones, just as before, must also still follow a strict set of rules. That way analysts may know, for each zone, what kind of quality guarantee they can be assured of.

The prepared zone

This is the first zone in the Analytical Region, and data here will probably serve well many kinds of analyses, requiring no further transformations.

While in the canonical zone every business entity was designed as a pair of tables storing everything that happened to it, in the prepared zone we are mostly interested in the entity’s current state. Hence, it can be designed as just one table, storing everything that is immutable plus the current state of everything that is mutable (plus some extra info, as we will see shortly).

Let us go back to the Purchase example of Episode 3. In that scenario, the entity purchase_1 was modeled as the following rows in the entity and events tables:

canonical.purchases:
+---------------+---------------+--------+--------------------+
|  purchase_id  |  customer_id  |  item  |  delivery_address  |
+---------------+---------------+--------+--------------------+
| purchase_1    | customer_1    | lamp   | 13 X St.           |
+---------------+---------------+--------+--------------------+

canonical.purchase_events:
+---------------+------------------+-------------+--------------------+
|  purchase_id  |    event_name    |  timestamp  |  responsible_team  |
+---------------+------------------+-------------+--------------------+
| purchase_1    | approve_payment  |  t_0        | warehouse          |
| purchase_1    | sort             |  t_1        | mailing            |
| purchase_1    | post             |  t_2        | third-party        |
| purchase_1    | deliver          |  t_3        | customer-success   |
+---------------+------------------+-------------+--------------------+

Here in the prepared zone, as we are mostly interested in the entity’s current state, every event that happened prior to the one in timestamp t_3 is, at first, irrelevant for us.

Hence, here in the prepared zone, the Purchase entity purchase_1 would look something like this:

prepared.purchases:
+---------------+---------------+--------+--------------------+-----------------+----------------------------+-------------------+
|  purchase_id  |  customer_id  |  item  |  delivery_address  |  current_state  |  current_responsible_team  |  last_updated_at  |
+---------------+---------------+--------+--------------------+-----------------+----------------------------+-------------------+
| purchase_1    | customer_1    | lamp   | 13 X St.           | delivered       | customer-success           | t_3               |
+---------------+---------------+--------+--------------------+-----------------+----------------------------+-------------------+

The prepared zone exists based on the belief that although historical data is useful, most times the analyst just needs the entities’ current state for their study — and preparing that data for them in advance just makes things simpler. We thought of that when designing TCA Modeling, and that premise has been proven true multiple times already here at Pier.

Notice that when modeling the current state of mutable data, we have added prefixes to columns: current_* and last_*. Those prefixes are mandatory in TCA Modeling, as they must make clear what kind of aggregation was used over the historical data. Sometimes, it may even make sense to add columns about other steps of the historical data — in our example, something like a first_event_at timestamp column could be useful for knowing when a Purchase was created in the first place.

There is no right or wrong when modeling columns in the prepared zone. The important thing is that granularity must not be affected — in other words, in the prepared.purchases table, every row must represent a single purchase. What you do with the columns is completely up to you!

Given that, it is not uncommon to find tables in the prepared zone with a few hundred columns —and we do have some examples of those here at Pier. And that is totally fine! As long as granularity is maintained, just keep on adding columns as needed. We will go over governing such big tables in a bit.

Notice also that a column current_state was added to the prepared table. As we mentioned in Episode 3, TCA Modeling assumes every business entity can be represented as a state machine — hence, it makes sense to have a column in this zone with the entity’s explicit current state.

The curated zone

This zone exists pretty much for governance reasons, and it should be fairly easy to build.

The curated zone is just a copy of the prepared one, redacting everything that can be considered Personal Identifiable Information (PII). We do have some suggestions on how to to do that, and we will go over those in Episode 5.

For now, just keep in mind that the curated zone is the same thing as the prepared one we described above, with no PII.

The analytical zone

Finally, the moment we have all been waiting for: the analysts’ paradise!

Sometimes, analyses cannot be produced by using simple business entities as granularity (as we have modeled in the prepared and curated zones). The analytical zone is where those tables with complex primary keys meant for analysis should be stored.

One common example regards time-series analyses. Up until this point, we have only modeled business entities, making primary keys simple IDs. However, that kind of modeling falls short when questions such as “how has an entity evolved over time” need answering.

For example, say we are interested in knowing the operational efficiency of the teams described in the Purchase example in Episode 3: one sensible first analysis that could be made is regarding the total number of purchases being handled by each team daily.

To answer that question, we could create a table that looks something like this:

analytical.daily_purchases_handled_per_team:
+----------+--------------------+-------------------+
|   date   |  responsible_team  |  total_purchases  |
+----------+--------------------+-------------------+
| day_n    | billing            |  120              |
| day_n    | warehouse          |  100              |
| day_n    | mailing            |  320              |
| day_n    | third-party        |  98               |
| day_n+1  | billing            |  114              |
| day_n+1  | warehouse          |  99               |
| day_n+1  | mailing            |  336              |
| day_n+1  | third-party        |  87               |
+----------+--------------------+-------------------+

The primary key of such a table would be the pair [date, responsible_team], which is clearly not a simple business entity, but a granularity created in a form that is useful for analysis. With a table such as the one in the above example, it would be easy to see that the mailing team is for some reason a bottleneck in the operation, locking up purchases from previous dates.

That is the whole idea behind the analytical zone: a place where tables with complex granularity can be stored for current and future analyses.

Notice that sometimes it is not possible to build those analyses just using prepared or curated data, and it may be necessary to go all the way back to the canonical zone to find the historical data your analysis needs (that is why we have a dotted arrow from the canonical to the analytical zone in our TCA Diagram).

It is also worth noting that, as this zone is where most of the analyses will be run, we should not be too selective about who ought to have access to it. For that reason, it is strictly forbidden for analysts to include PII in their analytical tables.

Data Quality

Quality rules in the Analytical Region should be just as strict as in the Canonical Region.

That means that data here should also be heavily documented and tested, even the tables at the analytical zone. The idea is that the analyst or scientist should have peace of mind that the data they are using is of high quality, validated and tested.

That does not mean that every analysis in the company should follow that long series of transformations we have described in this miniseries. That just means that the ones that do can be thoroughly trusted.

The goal of TCA Modeling is not to bureaucratize and slow down the company’s data development. Its goal is to bring confidence to what is important.

In that scenario, how to create faster analyses then? Well, we do have some other zones we have not described in this series yet (that will be shown in Episode 5), and one of them is the Sandbox zone. Here at Pier, analysts will first create their tables in this Sandbox zone — it is only when their analyses grow in importance that they consider moving it to the analytical zone, with all that implies.

In summary:

If data is in the Canonical or Analytical Regions, the Data team guarantees its quality.
That should not stop the company from running their adhoc analysis though. Analysts can populate the Sandbox zone as they please, and we have developed a process for making important analyses official by bringing them into the TCA Model.

Governance

Just as on previous Regions, there is no right answer about governance in the Analytical Region. Again, what we do have is a guarantee that data in every zone in this Region follows a strict set of rules, thus making governance easier to generalize.

What we do here at Pier is:

Access Levels on the prepared zone are the exact same as the ones in the canonical zone, shown in Episode 3.
The curated and analytical zones are simply open for every analyst in the company, because: (1) by definition they store no PII and (2) Pier is pretty open about its business numbers, as discussed in Episode 3.

Conclusion

It has been a long modeling journey from data ingestion to delivering value. Over the course of this series, we have made our best to show how we have been tackling Data Modeling at Pier, and hopefully this TCA approach we have adopted can be of inspiration to other Analytical Engineering teams out there.

It is important to say that this Region/zones approach we have designed was done so considering Pier’s data and governance needs. What that means is that you should consider this design as an inspiration for what you think would work best at your own scenario. Feel free to create your own Regions and zones if they feel necessary!

Next Up

We have already gone through the most important parts of our TCA Modeling design, but there is still some content we believe would be worth sharing.

In Episode 5, we will go over some extra cool content:

Other zones: we have mentioned the Sandbox zone here, but there are still some other zones we have created to make our lives easier, such as a Feature Store zone.
Standardization: what are the actual standards and conventions we follow for naming tables and columns, and what data types to use.
Real Examples: as we have mentioned in Episode 1, modeling an insurance business is hard. On Episode 5 we will show some real examples on how we do that!