Does your Analytics data chain start with the weakest link?

Marcel Driesen
Sogeti Data | Netherlands
6 min readSep 28, 2021

Business Intelligence focusses on bringing analytics and insights to the business. In many cases a data warehouse application is in between the source data and the dashboards and reporting presenting the information. Many books and blogs have the data warehouse enterprise layer and the presenting layer as main topics. However, in this blog we want to focus on the data delivery of the source system to a data warehouse. It is an underestimated part (“just give me the data”), but important part of your analytics solution because your data chain is as strong as the weakest link.

To load data into the data warehouse it is good to know how the provided data looks like, what the characteristics are of the data set. But it is even better to know how the data is best fit for loading into the data warehouse. Do not wait for a data set to be delivered by the source system to analyze how to load it into the data warehouse. Since source data delivery is the first link in the business intelligence data chain, the source system and the data warehouse experts should work together on this.

In this blog three guidelines are provided which work best for loading data into a data warehouse modeled with Data Vault** although these principles are valid for any data warehouse solution.

Data processing: push versus pull and time- versus event-based

Generally there are two ways the data can become available to the data warehouse. The data can be pushed from the source systems to the staging area of the data warehouse, or the data warehouse application retrieves (pulls) the data from the source systems via predefined queries. Often the pull to load the data from the source to the staging area is scheduled to retrieve the data in a specific timeframe to avoid incomplete data sets due to locking and to minimize performance impact on the source system. However, it can be difficult for the data warehouse application to know when the source is ready for this. This can lead to postponing the daily load to a time much later than the regular time the source system is available for extraction. Because this adds latency this is not the best method for getting the data into the data warehouse and the availability to the business user.

Design your flow event based instead time based

When the source system pushes the data to the data warehouse the source system is in control of the moment of extraction. It has the best knowledge about the status of the system and can start the extraction as soon as the daily business processes are finished, and locking and performance impact is minimal. Next, the data warehouse application has to start loading the data from the staging as soon as it is provided. It should be designed to be event based, i.e. on the event of data arrival the processing of the data into the data warehouse will start. In modern data platforms there might be an extra data lake layer in between the source systems and the data warehouse. If so, the processing in this layer has to be designed in the same event-driven way. Only in this way the data can be provided to the business in a fast way, without adding latency and uncertainty if the data was available when using time-based scheduling.

Data sets: align before design

Source data is usually delivered in data sets using files or views. The data provided in those data sets can be highly normalized or very denormalized, depending on the way the source system supports data extracts. Examples of a highly normalized data set can be the delivery of raw table content of an ERP system or any other generic configurable application. Information around a business objects is then provided in multiple data sets, which often can only be linked using technical source system keys instead of the key known and used by the business. An example of a denormalized data set is the usage of a reporting extension of the source system to deliver data to the data warehouse. The data sets are originally prepared for reports and combine multiple business objects in one data set. Many times the business objects are only included partially, and other data sets is needed to gather all data elements of a business object. When loading the data to the data warehouse it results in sub optimal loads to many different Data Vault tables for one data set. Both extremes of data sets need to be avoided if possible. Contact your source system subject matter expert to discuss the possibilities and design the data exchange to the data warehouse carefully.

Data integration: keys

I usually model data warehouses using the Data Vault methodology. The Data Vault organizes data around business objects. The business keys are used to identify, track and locate information of business objects. Business keys are supposed to have meaning to the business. Business keys can have a meaning across systems within an organization e.g. the same business key is identifying the customer in the CRM system and the Order system. It provides the first level of data integration between CRM and Order data in the Data Vault. The opposite of business keys are surrogate keys used by operational systems to identify the business object. These keys are internal to the system and not shown or used by the business. These keys are not likely to be used cross platform. To be able to organize data around the business keys in the Raw Vault it is important that the source system delivers data with the business key. Also when a data set contains references to other data sets (a foreign key) the source system needs to provide this reference with the business key and not the surrogate key!

If the source system cannot / will not deliver the business key for these references and will only deliver the surrogate keys the only option is to load the data into the Raw Vault (first part of the Data Vault) using the surrogate key as the business key. Only in the Business Vault (second part of the Data Vault and extends the Raw Vault) it can be corrected to integrate the data around the business key. I do not recommend to try to replace the surrogate keys with business keys in the staging area (temporary allocation towards the Data Vault). This will create a dependency between data sets in the staging area before the data can be loaded into the data warehouse. It also requires full sets to be delivered, because if only delta sets are delivered, the lookup for the business key in another delta data set is unlikely to find all needed values.

Final thoughts

The three guidelines above are not the only important ones when designing the data chain. But they are easily overlooked by, for instance, taking the first proposal of every source data delivery for granted. The important takeaway of this blog is to design the data delivery from the source system to the data warehouse together with source system experts. Share the needs, wishes and difficulties from both sides to design an optimal interface and data delivery and make sure your data chain start with a strong link.

** I assume the reader has some knowledge about Data Vault modeling and knows the concepts of Hubs to store business keys, Links to store relationships between business keys and Satellites to store the change history of attributes belonging to a business key.

--

--