How to Organize your Data Lake / Warehouse — Miniseries

Episode 2: The Technical Region

Flavio Altinier
Pier. Stories
8 min readJan 31, 2023

--

This is the second of a series of 5 episodes describing the TCA (Technical-Canonical-Analytical) Data Modeling approach, an evolution over Star and Snowflake schemas for data analysis. The links to other episodes are:

The episodes should be read in sequence, as we are expanding the details of TCA Modeling progressively. So, for the continuity of this article, we will assume the reader is familiar with the concepts described in Episode 1.

Picking up from where we left off

In Episode 1, we had a quick overview of how TCA Modeling works, with the three Macro Regions (Technical, Canonical and Analytical) in a diagram. In this Episode we will dive deep into the first of those regions: the Technical Region.

TCA Modeling diagram with emphasis on the Technical Region

The whole idea behind the technical region is that is should congregate transformations which can be done with no business logic whatsoever. Every transformation in this region should be generic and scalable — in a way such that when a new data source is added to the LakeHouse, it should travel all the way down to the rebuilt zone flawlessly.

Over the course of this series, our focus will always be on what kind of data should lie inside of each zone. That means we will not concentrate on technologies or tools for storing and transforming that data —there is already plenty of material out there for that. So, although sometimes these articles may mention the tooling we use to make the examples clearer, understanding how that apparatus works will not be our focal point.

With that out of the way, let us dive further into each of those zones, to have a better understanding of how they work.

The Transient Zone

This is the simplest of the zones in the Technical Region: it is just where the data first lands when it reaches the LakeHouse.

This transient zone approach is fairly common on other Data Lake arrangements out there, as it is pretty much embedded with the ELT concept itself: data should not be transformed before being loaded into the Lake.

So, in TCA Modeling, the transient zone behaves exactly as in common ELT approaches. Ingested data should be stored into the transient zone in any convenient format, structure (or lack of), confidence or timeliness. Once the data is safely stored, we can start playing around with it.

The Raw Zone

The idea behind the raw zone is that, here, transient data should be better organized and standardized. Again, not that much different from common Data Lake organization approaches.

In the raw zone, every piece of data should follow the same standard:

  • No data from the transient zone should be altered (yet). No aggregations, filters, nothing. Even if it feels super messy, this is not the time to fix anything — there will be plenty of zones ahead for that.
  • From this zone on, every piece of data should be stored in the same format. As there are going to be multiple transformations ahead, it is always nice to choose something that scales well with your LakeHouse technology. Here at Pier we use Athena as our query engine — hence, it makes sense to save everything in parquet.
  • Everything must be SQL-readable. This is where most of the engineering focus in this region should be: creating translators from a mess of formats into a single, standard relational format.
  • Every column must be structured. Even if unstructured data was ingested, this is the time to structure it. Most tooling out there support complex data types — use them.
  • Every column must be strongly-typed.

Just to follow this strict set of rules correctly, a lot of engineering work is already needed — and we have not changed a single row of data yet!

The good news is that this transient-to-raw transformation, though difficult to build, scales pretty nicely as no business knowledge is needed. In other words, if done correctly, you can just build these transformations and forget they exist, as it is all pretty default.

The Clean Zone

This is where things start to get more interesting.

The idea behind the clean zone is that it should be a copy of the raw zone, filtering out everything that was ingested or loaded by mistake, or that carries no useful information at all.

Some common filters to use here would be:

  • Removing empty rows
  • Removing duplicate rows
  • Removing test rows — there is usually something the data generators use to flag this kind of data

Those are fairly basic, but there are plenty of other filters which could be useful depending on each use case. Have a look at this example:

Here at Pier, we ingest data from our back-end’s postgres databases using a Change Data Capture approach, reading their write-ahead logs. That means that, for every table, we get the their logs for every insert, update or delete operation, together with some metadata to make it easier to understand the operations’ history.

Let us say we are ingesting an example_table with columns ID (which is the table’s primary key), column1 and column2. In the raw zone, example_table would then look something like this:

+-------------+----------------+------+-----------+----------+
| Operation | CDC Timestamp | ID | column1 | column2 |
+-------------+----------------+------+-----------+----------+
| insert | t_1 | A | value1 | value2 |
| update | t_2 | A | value1 | value3 |
| insert | t_3 | B | value4 | value5 |
| delete | t_4 | B | | |
+-------------+----------------+------+-----------+----------+

In this example, the ID, column1 and column2 are the original table’s columns, and the Operation and CDC Timestamp columns represent metadata added by our CDC pipeline. Hence, we can understand the exact state of our databases at any given point in time — more on that in the following Episodes.

Sometimes, our back-end systems will invalidate their database caches by making a dummy update operation on a table: an operation that does not really update anything. In our CDC ingestion pipeline, that creates a new “update” row that changes nothing for that entity other than the CDC Timestamp column. That operation is useless for analysis and only makes transformations harder down the line — so, we filter those rows out here in the clean zone.

Keep this example in mind, as it is also going to be useful for the next section.

The Rebuilt Zone

Entities usually go through multiple operations in their original systems. In our example_table, entity A went through an insertion and an update, and entity B went through an insertion and a deletion.

That historical data is extremely valuable for anyone trying to understand an entity’s evolution over time. For most users, however, although interesting, that data is not really that important: they are only interested in the entity’s current state.

That is what the rebuilt zone is for: summarizing entities (primary keys in the original system) into a single row for each table, in such a way that the columns reflect the entity’s current state. In our example, as we are ingesting operations from a relational database, the rebuilt zone would look like a snapshot from the current database state.

In the rebuilt zone, our example_table looks like this:

+------+-----------+-----------+
| ID | column1 | column2 |
+------+-----------+-----------+
| A | value1 | value3 |
+------+-----------+-----------+
  • Entity A had two operations: an insertion and an update. Here in the rebuilt zone, we bring its last known state — in other words, its state after the update.
  • Entity B had two operations: an insertion and a deletion. As its last operation erased it, we simply ignore this entity in the rebuilt zone.

Transforming data from the clean zone into the rebuilt zone follows a pretty straightforward algorithm:

for every entity:
row = row with greatest CDC Timestamp
if row.Operation != 'delete':
yield row

Notice that although we are dealing with the relational database use case, this approach works nicely with any kind of “operations over entities” data as input — data such as Events.

This is the only zone so far where data is mutable: while the transient, raw and clean zones keep an eternal history of every operation executed over data, the rebuilt zone stores only the data’s most recent state.

Access Level Policies

In TCA Modeling, data in each zone must follow a strict set of rules. That makes it easier to create a map of which kind of user should have access to each zone.

That mapping may vary a lot from company to company, depending on many internal and external factors: the corporation’s organogram, local legislation, or other governance policies. The important thing is that, with the data divided in zones, that mapping is possible and easier to do than the most usual approach of just limiting access when the legal team asks you to.

Just as an example, here at Pier we enforce the following policies:

  • Transient zone: only the Data Engineering team has access to it. People should not be playing around with transient data.
  • Raw zone: add some very specific developers to the bucket just so that they can have a look at the data they are producing.
  • Clean zone: add a few more developers, and anyone responsible for populating the Canonical Region (more on that on the next Episode). Here, those would be our Analytical Engineers.
  • Rebuilt zone: the same as the clean zone.

Summarizing

At the end of the Technical Region, data has reached the finest state possible without tempering with any business logic. This may still seem like a long way to truly delivering value, but make no mistake, many stakeholders can already take advantage of the data created in this zone:

  • It is a bad practice to give direct access to developers to production databases, even for the most senior ones. Those may be granted access to the Data Lake instead, and may find a perfect copy of their database there at the rebuilt zone just waiting to be queried
  • In a micro-services back-end, having every database to be “joinable” at the rebuilt zone is a great feature for developers
  • Having the history of operations is excellent for debugging problems in production
  • For regulated businesses (such as insurance), that historical data is worth gold!

In my experience, if you made it as far as the rebuilt zone, you are already ahead of ~80% of the market.

Next up

In the next Episode, things will start to get trickier as we begin to inject business logic into our transformations. Be sure not to miss it!

--

--