Database Modeling: Relational vs. Transformational
Map meets territory in the definitive guide to relational and transformational modeling. How to use each and get the most out of both.
I image-googled “modeling” and this is what came up. While it has absolutely nothing to do with databases, it accomplishes two important things: it grabs attention (thanks for clicking) and illustrates that modeling is indeed a broad concept. You may have heard it in varying contexts: relational, dimensional, data, logical, physical, transformational, maybe even hand?
Do not be alarmed. While there are indeed many modeling-s, they tend to fall into two principal categories related to common BI workflows. This article will demystify what they are, where they are used, and most importantly, how to combine them to unlock their full potential.
Although the word is often thrown around without a qualifier, modeling usually refers to two related, though seldom overlapping, activities: relational modeling and transformational modeling. Relational modeling defines how data can be stored, organized, and manipulated, and Transformational modeling is the manipulation that follows (a.k.a. the “T” in “ELT”, or simply, data analytics.)
In Business Intelligence (BI), particularly in data warehousing, we often find ourselves upstream of original source data — constructing analyses or building new data models on top of existing tables. For many analysts, this type of transformational modeling may be the only kind they engage in.
However, this does not mean that relational modeling is unimportant. It means that in most cases, someone has already done it for us. First, let’s make sure we understand relational modeling and how to leverage it to make transformational modeling easier.
The previous article (The Lost Art of Dimensional Modeling) broke down how to plan and build a data warehouse, dimension by dimension. Relational modeling takes that one step further, providing a declarative method to describe both data and structure in a relational database.
Relational modeling is mainly concerned with the structure of data and its internal relationships — not the data itself. That is to say, it explicitly defines database objects but only describes data at a meta-level. It’s the DDL side of SQL, concentrating mainly on the CREATE statement and column specifications.
Ok, but why use relational modeling?
Because humans can’t natively read data or DDL (if we could, “data literacy” wouldn’t be a thing.) Humans can, however, understand entities (e.g., store, employee, sale) and associations (e.g., “employees at a store” or “sales by employee.”)
Although we tend to think in terms of tables when working with SQL, rarely do we try to envision multiple rows at a time — that will come later. The first question is always, “what does a single row of a given table represent, and how can I tie it to other tables?” The technical term for this is a tuple, and tuples are grouped into relations, or, simply, tables.
In practice, defining a relation looks something like this:
CREATE TABLE DimStore
StoreId integer NOT NULL,
Address string NOT NULL,
SizeCategory varchar(1) NOT NULL,
CONSTRAINT Pk_DimStore PRIMARY KEY ( StoreId )
This can also be represented visually:
This type of representation is called an Entity-Relationship (ER) diagram, and it makes relational models easier to understand and interpret. Cognition through visualization is pretty much a human superpower. Here, I am using SqlDBM — which is quickly becoming the industry standard for cloud-based database modeling — to visualize the tables.
The simple example above tells us that a single store in our database can be identified by a unique StoreId and contains Address and SizeCategory properties.
Next, we expand the diagram by adding the DimEmpoyee table. The relational diagram now tells us that an employee record can contain a single and unique StoreId or null. Therefore, an employee can only work in one, none, but not multiple stores at any given time.
Now let’s try and answer the same question without relational modeling — using only data.
Having access to the data in these two tables, I can easily answer questions like “how many stores are there?” and “what’s the average employee count by store size?” — I can even see that today, the total amount of employees working at multiple stores is zero. However, this doesn’t prove that such a thing can’t happen tomorrow.
Whether performing ad-hoc data analysis or constructing pipelines and data processes, we usually begin by selecting a data set — perhaps from a single table or multiple tables using a join condition. The result set may contain extra columns that have been calculated at the row-level or aggregated across the entire set.
Whatever the requirement, the DML (data manipulation language) side of SQL gives the tools we need to construct the result set using existing data. For example:
SELECT * FROM TABLE A WHERE COL_1 = TRUE
But that’s still data analysis. Transformational modeling happens when elements of DDL and DML are combined, like in the case of the CTAS (Create Table As Select.)
Like relational modeling, transformational modeling still creates tables (and views), but it is less concerned with defining data types and relationships. Using transformational modeling, a table can be defined using nothing but logic. For example:
CREATE TABLE B AS (SELECT * FROM TABLE A WHERE COL_1 = TRUE)
Table B inherits all the structural properties of table A, as well as a copy of Table A’s data that meets the selection criteria.
You need both
Although Table B inherits structure and data from Table A, it does not inherit A’s primary or foreign keys. Since relationships can not be defined through logic, Table B now sits adrift in the vast expanse of the database schema, unrelated to any other entity.
It might help to think of relational modeling as a city map, showing every house and neighborhood and including the streets and highways that connect them — perfect for planning an itinerary. Transformational modeling is the itinerary itself: a route from home to work, with a stop at Starbucks in between to pick up coffee and a muffin.
The map shows all possible routes, while the itinerary shows the one very detailed route you took to get somewhere specific, plus all the stops you made in between. Here’s what that looks like in practice.
Why you need both
Let’s build on the earlier example of employee and store by adding a sales fact table that includes the employee id as a foreign key. On top of that, let’s create a sales analysis table, with some added detail, as might be used for a reporting dashboard.
The analysis table is built directly on top of the sales fact table. It has the same grain and primary key. While the store and employee ids were not deemed necessary for reporting, the employee title and store size were.
Someone looking at the relational diagram for this scenario might (rightly) conclude that DimEmployee and DimStore are unrelated to ResellerSalesAnalysis. So how, then, did we pick up that coffee and muffin (StoreSizeCategory and EmployeeTitle) on the way from FactResellerSales to ResellerSalesAnalysis?
The beauty of transformational modeling is that it relies on the ingenuity and creativity of the developer to achieve its end. Just as reviewing a map allows one to plan the ideal itinerary, transformational modeling allows one to shape data to one’s exact needs.
If a reporting table is requested based on FactResellerSales but requires details from other tables, relational modeling can tell us which sources to look in, and how to link them. Like Google Maps for databases.
Using the relational model, we can quickly determine where the required data resides and relate the tables to one another. This makes creating transformational logic trivially easy. In this example, it would look something like this:
create table ResellerSalesAnalysis as (with join_emp_store as (select * from DimEmployee
inner join DimStore on
DimEmployee.StoreId = DimStore.StoreId
), join_emp_store_Sales as (select * from FactResellerSales
inner join join_emp_storeon
FactResellerSales.EmployeeKey = emp_store.EmployeeKey
), final as (select
SizeCategory as StoreSizeCategory,
Title as EmployeeTitle,
CustomerPoNumberfrom join_emp_store_Sales)select * from final)
While it’s true that ResellerSalesAnalysis table has no relationship to DimStore or Dim Employee, it does have a dependency on them in the transformation we created.
Although there are similarities between the two, transformational relationships can not be represented on an ER diagram. They are best viewed on a lineage graph, which tracks data flow from source to target entities. In this example, I am using dbt, the industry standard for transformational modeling to visualize the resulting model.
This is what the lineage graph looks like for the transformation above:
Granted, a lineage graph does not convey nearly as much detail as an ER diagram, but that’s because it doesn’t need to. The SQL embedded in the associated transformation is the best abstraction to express its complex analytic logic.
Of course, transformational modeling goes much deeper than the CTAS, with things like INSERTs, UPSERTS, streaming data, etc. which fall outside the scope of this simple example but present similar challenges for visualization on an ER diagram.
Now that we have created ResellerSalesAnalysis, we can close the loop between the two modeling types by reading the DDL associated with this table. Using SqlDBM’s “Reverse Engineering” functionality, this DDL can be automatically pulled into our ER diagram, and a real relationship can be established.
Doing this allows all the tables involved to exist in a single tool, expanding the borders of our map, if you will. This means we can leverage other features from SqlDBM, like documentation and version control, all in a single place.
What bout views?
Views are tricky because they encapsulate transformational (DML) logic inside a relational (DDL) database object, and despite all attempts at labeling, are fully supported in either modeling type.
In the previous example, ResellerSalesAnalysis could have been modeled as a view directly in SqlDBM. Simply by changing the word “table” to “view” in the associated DDL, we can bridge the gap between transformational and relational modeling and display the object right on the diagram.
Of course, the option of Reverse Engineering a view is always there for us as well.
When I hear developers say “I don’t need relational modeling, I can do all the joins myself,” what I really hear is, “I don’t need a map. I know my way around.” Which is often true — until it isn’t.
In the life-cycle of enterprise warehousing, it doesn’t take very long at all before no single developer can cope with the growing number of relational sources and resulting transformational entities. This is when teams begin forming knowledge silos to help distribute the load — until that load becomes too much for one team to bear.
Eventually, the growing system complexity forces the decision: scale knowledge or scale bodies (hint: one is drastically cheaper than the other.)
By combining transformational modeling (analytics) with relational diagrams, we transform a sprawling data grid into a Ctrl-F-able data map.
When the two are working in harmony, the organization sees a thriving data community: a growing data metropolis where new developments are constantly being created, and the street directory keeps pace. The alternate scenario is the data wild-west, where you had better find an expert data ranger to help you make your way.
Yes, transformational modeling delivers data to the business, but relational modeling helps deliver the transformation.