Basics of Building a Data Warehouse: Part 2

Seth Goldberg
Charting Ahead
5 min readJul 29, 2017

--

Now that we’ve gone over the basics of selecting your database, it’s time to move on to your approach to warehousing your data.

Data Modeling

There are three major options for data modeling: top-down (Inmon), bottom-up (Kimball), and DataVault (a hybrid). Below I’ve provided a quick summary for each one.

Third Normal Form Model

The top-down approach (Inmon) follows the classic 3NF model, which means reduced redundancy and tightly integrated data. One of the important elements of this approach is that data is cleansed BEFORE it goes into the warehouse. Unfortunately, this means increased error potential, since “cleaning” the data can lead to more bad data. If you are going to take the top-down approach, I would advise that you ensure all your scrubbing looks correct to the relevant SMEs.

Source: https://www.logianalytics.com/wp-content/uploads/2015/09/star-schema.jpg

The second major approach to data modeling is the bottom-up approach (Kimball). Also known as “star schemas” or “dimensional models.” This approach is, by far, the most popular, due to its speed and simplicity. Peruse any Business Intelligence tool, and you’ll see that it supports dimensional models without any required modifications.

Dimensional models consist of two key parts: facts and dimensions. The facts contain what you are measuring, and the dimensions contain the metadata around the facts. Let’s examine this in the context of a payment system. The payment would be the “fact,” while the customer would be the “dimension.” The goal of the dimensional model is to make things as easy as possible for your users, which means that you’ll likely end up transforming data to make simple facts and data labels for end users to pull into their queries. Unless you have a specific use case, I highly recommend this approach to analytics. The goal of any technology should be to make your users’ lives easier, and this approach does a great job of that.

Another approach to data modeling is DataVault, which focuses on creating “hubs,” “links,” and “satellites.” The focus of DataVault is on being adaptable and agile. Though in my opinion, it deviates too much from the goal of making a system designed for humans.

In the end, it’s important to pick the modeling strategy that best fits YOUR requirements. Just like with anything, there is no such thing as a one-size fits all with data modeling techniques. If you are unsure of which one to use, I would recommend starting with the bottom-up approach, since this is the most popular and conducive to reporting.

Source Data Storage

Another important topic is that of source data storage. Although this may seem pretty mundane, properly storing your retrieved data is vital to its integrity. It might seem easy to simply transform your data in flight or store a limited amount of history in your data warehouse. However, I strongly advise against this. Instead, I believe you should keep as much history in your warehouse as possible for two primary reasons:

  1. To allow immediate access to data that has not been transformed into your warehouse model. It can take awhile to get data into your warehouse model, so preserving the history helps ensure that your users have access to it. On the other hand, your users may need to hit the raw data in order to figure out what they need in the warehouse model. Doing this will allow them to do their data mining.
  2. To reload your warehouse model. By retaining the raw data, you can easily and quickly reload or add new data points to your warehouse model. Without this, you might have to resort to reloading data from your source systems, or worse, your backup tapes! Obviously, there will be cases where you simply cannot store all of your source data. In this event, I would highly recommend you offload data to a hot store like Hadoop (see, I’m not anti-Hadoop :)).

Data Modeling Tools

Data modeling is one of the most important activities on any database. Although you may be inclined to save money and skip out on a data modeling tool, investing in a tool is worth the money. A good data modeling tool makes your life easier and conveys the model to non-technical business users.

There are three key abilities of a good data modeling program:

  1. The ability to display the physical model,
  2. The ability to display the logical model, and
  3. The ability to generate the necessary SQL from your changes to be included into deployment scripts.

In order to further illustrate the importance of these abilities, I’ll give a brief overview of the difference between a logical and physical model. The logical model is the business view of your data. It has things like English words and nondescript acronyms spelled out. This is the model that is intended for consumption by human beings.

The physical model is the model of what your data looks like when it goes into your database, and is the only model that your technical developers care about. The idea is that you build your logical model, translate it into your physical model, and then generate the SQL from the model so that your database matches your model after deployment.

Cheaper tools will often forgo the logical model. However, if you have the budget, I would advise getting a tool that supports the logical model! This will come in handy when you don’t get a million questions about what the abbreviations and acronyms in your physical model mean. The most important point to using a modeling tool is that it forces you to really think through what your data looks like, and then see it visually.

Endnotes

Figuring out how you are going to model your data is vital to your image. The model is what your users interact with when everything is all said and done. Aside from query performance, the model will be the thing that makes analysis easier, or harder…

Stay tuned for part 3!

--

--