A Complete Beginner’s Guide to Data Warehousing
When I was first about to meet Ahmed Elsamadisi, the founder and CEO of Narrator, I decided to actually learn about data warehousing, star and snowflake schema, and all that fun stuff. But then I realized, there weren’t any quick guides for beginners like me. So I decided to make one.
Before we go any further, I need to share something critical that I learned while talking to Ahmed about the entire data world. Definitions won’t get you anywhere. You need to understand the problems that are present in the data warehousing space and address them and knowing just knowing what a star schema is won’t help you do that.
So just want to learn some definitions and basic ideas? Read Part 1.
For a more in-depth look at the issues currently with the data warehousing field and how to shift your mindset to become better at analysis, go to Part 2. Or do both!
Part 1: The Bare Bone Basics
Data Warehousing: storing and connecting all crucial business/company data in a single space
The main purpose of organizing data in a go-to, easy-to-access way is so that it can be analyzed later on. Or in fancier terms, it should be able to support Business Intelligence (BI) actions later on, with the most important aspect being analysis. There’s no point in having huge amounts of data if it’s super messy and can’t be analyzed for optimization! That’s where the two most popular data warehousing models come in, star schema and snowflake schema.
Star Schema: a data warehousing model that consists of fact and dimension tables, where the dimension tables are denormalized
Let’s unpack that definition. In this model, you have pieces of data called measures and others called dimensions. The measures are numerical data like the number of units sold or the number of emails opened. Dimensions give context to the measures. The dimensions for the aforementioned examples could be UnitSales and OpenedEmails, respectively.
These measures are stored in fact tables and dimensions are stored in dimension tables. The fact tables have dimension key columns (that relate them to the dimension tables) and numeric measure columns, which store the measures.
Fact tables come in 3 distinct flavours.
- Transaction fact tables: these store data about a specific, one-time event so the data is only stored once
- (Periodic) snapshot fact tables: as the name suggests, these store snapshot data or data at a given point in time, like the number of purchases per customer at the end of the quarter. This means the data is stored multiple times at regular intervals.
- Accumulating snapshot tables: and these store accumulating data, or data over time, like the total number of units of a certain product sold over time
Dimension tables can vary a lot across industries and companies but here are some of the most popular types.
- Time dimension tables: these store the time at which the related measure in the corresponding fact table was true
- Geography dimension tables: these store, you guessed it, geographical data like city, country, postal/zip code, etc.
- Product dimension tables: these store information about the product like the productID, product name, and brand name
- Employee dimension tables: these store data about the employees like employeeID, employee name, address, etc.
- Range dimension tables: these store ranges of data like a time period or a range in price, which can help report later on
This is cool and all but what the denormalized thing from the definition before mean? Your data can be normalized or denormalized. Normalized data is when the data’s stored in multiple tables with relationships linking them to each other. It helps maintain more of the data’s integrity and reduces redundancy but it slows things down since the computer has to jump from one table to another, multiple times. Denormalized data which is how star schema’s organized is when everything’s organized into one main table along with some other dimension tables. The real difference is that there are only direct links to the fact table. However, normalized data can have multiple links.
Things should be clicking into place now as to why this model is called star schema. It’s because it looks like a star with a central fact table and dimension tables “fanning out” from it.
Snowflake Schema: a multidimensional data warehousing model that consists of fact and dimension tables, where the dimension tables are normalized
It’s helpful to think of a snowflake schema as having the same so-called base as a star schema. It also has a central fact table and dimension tables connected to it. However, to reduce redundancy, optimize storage, and generally clean things up, snowflake schemas have dimension tables connected to their first layer of dimension tables.
Don’t be scared off by the jargon. Here’s how Ahmed would explain a star schema to a 7-year-old!
Imagine having a bunch of excel sheets with information in columns. And you have a laser pointer that can connect the information in one table to the others.
Part 2: Understanding What’s Wrong
You’re here because you want to understand data warehousing on more than just a surface level. I’m proud of you!
Let’s start with something from the real world. Ahmed, what’s the #1 biggest problem you faced as a data engineer with the current models (star and snowflake schema)?
Think of the connections between the different tables as keys unlocking relationships. What happens when this key doesn’t exist, when it doesn’t have that connections info.
Let’s say one tab is calls and another is website visits. When you want to find the person who visited the website and then called, what do you do? Nowadays those keys aren’t really there.
But the issue goes deeper than that. It’s about asking the right questions from the data to get actually useful insights that can be used to garner growth.
Let’s do an exercise that Ahmed showed me to see how flawed your thinking really is.
What’s your most used app? I’m sure you picked something productive, but let’s go with Instagram this time. If you were working for Instagram, how would you get people to use it more often?
Think about it for a second then scroll.
Did it have something to do with adding a new feature? “If they added more reactions for posts, I’d use it more!” But take a minute to think it over. If that feature wasn’t there, would you stop using Instagram… and would it really increase your use?
Right now, a lot of us are thinking of features and little gimmicks but you need to think about what drives behaviour. Some of you might think we’re deviating from data warehousing but a large part of dealing with data is understanding it because like most anyone in the data analysis field will tell you, data will always be messy and you will always be asked questions like “how should we promote growth?” You need to comprehend the data to answer those questions!
To finish off, Ahmed what would your advice be to a beginner interested in the field?
It’s about understanding the problems with data engineering, not just the definitions. You have to actually work with data and see patterns between the gaps. That’s what separates a senior engineer from a junior one.
Cool Resources To Learn More
In computing, the star schema is the simplest style of data mart schema and is the approach most widely used to develop…
Transactional Fact Tables
Transactional fact tables are not as popular as periodic snapshot fact table. In this article, I would like to compare…
- Watch this great video (on 1.5x speed) for a quick review of what a data warehouse is
- Another quick vid, this time talking about the main differences between star and snowflake schema
- Understand what’s wrong with current structures and try to fix them. Play around with data and get your hands dirty!
- Shift your thinking from a gimmick-based one to one focused on behavioural shifts.
- And most importantly, follow your curiosity; may the data ever be in your favour!