Agile Data Warehouse Modeling through Modelstorming

Published in

Traveloka Engineering Blog

7 min readJul 24, 2019

Authors: Rendy B. Junior (Data System Architect) & Dimas H. Adiluhung (Data Engineer)

Data warehouse development projects is commonly seen as a multi-year project. By the time it finishes, the model is no longer relevant because the business changes over time. To illustrate, a user flow may change frequently because several A/B experiments show us that a new flow is more desirable for a user. Hence, the circumstances necessitate changes in the data for user conversion.

Living in a stateful world of data realm is a constant challenge demanding us to be agile in order to adapt to the fast-changing and competitive nature of the world. The Agile Manifesto was created to address such situation, replacing the old way, where things used to move at slower pace with less change. So, the question becomes how do we apply the same agile principles in data warehousing discipline.

Dimensional modeling offers the possibility to do agile development as the modeling is done per business process [1]. In contrast, some other methods such as ER normalized model [2] requires a vast investment upfront to understand all the business model in advance. We use dimensional modeling as our thinking framework. But, how we should work on it is the next question. In the process of learning how we should do agile data warehouse modeling, we stumbled upon a reference [3] that coined the term “modelstorming”.

The Gist of Modelstorming and its Inherent Process

Disclaimer: What we are sharing with you here is our customized version of Modelstorming that we practice to fit our architecture’s needs and contexts better for we don’t apply everything that is mentioned in the reference [3].

The term modelstorming comes from modeling and brainstorming. The idea is to brainstorm the model with the stakeholders and come up with a simple prototype at the end of the modelstorming activity.

The modelstorming process consists of several steps; from metrics definition to the deployment of the data pipeline to production.

Define the metrics. Work with analytics team to get understanding of what insight business users actually need.
Modelstorming session preparation. The data engineer defines what are the sources of data and creates the SQL views for the metrics.
Modelstorming session. The data engineer shows the data and asks stakeholders for confirmation whether the definition has been aligned and gets direct feedback on the model during this session, whose final output is a model prototype in the form of SQL view.
Prototype testing. The analytics team tests the prototype extensively and gives feedback if there is something unexpected from the data.
Pipeline implementation. The data engineer then implements and schedules the processing.
Final acceptance. The analytics team then does the final acceptance testing on the data.

To manage the process, we use Kanban board with predefined steps. The process is repeated for each block. A block, a term we create for ourselves, denotes a collection of metrics for a business process. So, in each iteration, we focus on several metrics that are related to one specific business process. In practice, the process is not necessarily sequential. Step 1–3 can continue to the next block without waiting for step 4–5 to finish.

Modelstorming Steps

Define the metrics

Data warehouse is subject-oriented meaning information stored around a particular subject area (e.g. customer, user, product, sales, payment, invoice). So, the natural way to design a data warehouse is by listing down business processes and its related entities. In the end, tables that exist in data warehouse will represent the company’s business processes and entities.

However, what’s important for business department is the metrics instead of business process as mentioned before, as they can measure business efficiency and effectiveness using metrics. So naturally, the requirement that is proposed by business stakeholders to data warehouse team is metrics and the capability to drill up/down/through sliced and diced by some dimension (metrics’ point of view).

Analytics defines the metrics and data engineers may raise question or seek clarification to the analytics team in order to have clear metrics definition. We usually focus on selectively few very important metrics of the business (north star metrics). These metrics are in turn used as one of the inputs to design the data model for business needs.

Modelstorming Preparation

The reason we build data warehouse based on business processes and entities instead of on metrics is because building a data warehouse is mostly about building capability rather than delivering specific reports. It’s a mistake to take a set of metrics in the requirement and build a data warehouse only to satisfy the reporting requirement. Tomorrow, new metrics might arise, which would fundamentally change the data warehouse (usually the detail level, known as the grain of a fact table). So, the course of actions after receiving metrics requirement from business side are:

Map the metrics requirements onto business processes and entities, which is then followed by listing down information/fields that naturally exist in the business process in order to have logical model of the data warehouse.
Define the fact table grain. It should be the lowest grain of transaction in the business process and map the dimensions onto enterprise bus matrix [4]
Based on the logical model, create physical table design that is suitable with technology or tools used in the data warehouse
Map the fact table’s fields onto data in the source systems (source-to-target mapping)
Design summary table that contains ready-to-consume data by the BI tool

After the design has been made, we then create a prototype using an SQL view to query data directly from raw data to dimensional model and another SQL view from the dimensional tables to the summary table.

Modelstorming Session

In this step, we present the prototype to the stakeholders to see confirmation whether or not the solution design fulfills the business’ requirements or not. If needed, after a modelstorming session, respective stakeholders can explore the prototype more thoroughly to further check whether the logic in SQL view produces the desired data. Since BI’s requirements will be fulfilled by the summary table, this table naturally becomes the main focus of discussion during modelstorming session. However, it’s better to also pay attention to the low grain table since it will become the base of data exploration to derive new insight from data.

Pipeline Implementation

There are two processing jobs that need to be designed. First is from the source to the lowest grain data warehouse table, which is made up of two parts:

Data cleansing, flagging, or filtering out unexpected data pattern.
Data transformation based on business requirements.

The second one is from the lowest grain to summarized data warehouse table. To speed up BI tool query when accessing data warehouse, summarized data should be provided. Table grain depends on BI requirement, which can be daily, weekly, or data grouped by some dimension.

In some cases, user needs to read data dating several years back and it is required to refresh the data in daily basis. Therefore, to build processing pipeline that read entire data in the source table is not an option. One solution is to create a snapshot table that is created by merging previous snapshot with current transaction or summary.

Final Acceptance

Before data engineer deploys the pipeline to production, user acceptance session needs to be performed to test the implementation result.

What We’ve Learned

There are both benefits and drawbacks of the modelstorming process. One of the major benefits we see is that the data modeling process becomes much faster. In the past, we need to wait for some time until we get a useful feedback on the data model we built. By having a session, where we show the data prototype using SQL view, analytics team can give feedback directly making the process faster than before. Another benefit is that we also find it useful to avoid changes on data processing code, which is a lot more costly to change when it has been deployed already to staging compared to only SQL view.

However, the process also has some drawbacks. We are still trying to find a balance between number of metrics vs number of iterations. When there are too many metrics involved, the resulting iteration becomes too long. In contrast, when there are too many such iterations, the overhead becomes quite high especially on the data backfill. Agile application in data engineering turns out to be quite different from software engineering primarily due to the backfill. We are working in a stateful realm, where we need to manage the state (data in the data warehouse) and backfill whenever there is any change in the schema or logic. Usually, multi-years backfill requires a long time to run and also proves to be quite costly.

Overall, the process of modelstorming serves us well to deliver value to the business and we continue to iteratively improve the process over time.

References

[1] Kimball, Ralph & Ross, Margy. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 2nd ed. London:Wiley, 2002

[2] Inmon, Bill. Building Data Warehouse by Inmon 4th ed. London:Wiley, 2005

[3] Corr, Lawrence & Stagnitto, Jim. Agile Data Warehouse Design. DecisionOne Press: 2011. Video presentation: https://www.youtube.com/watch?v=ZNrTxSU5IQ0

[4] Enterprise Bus Matrix https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/enterprise-data-warehouse-bus-matrix/