What is a Data Warehouse?
Usually when we speak of Data Warehouse we have two different perspectives. One is the business perspective and the other is the technical perspective.
Business Perspective
An appropriate data warehouse should be able to support business activities. Business activities include: customers able to find the products when shopping, inventory staff should be able to stock or retrieve items for certain goods, HR team should be able to access staff performance and marketing team should be able find an effective sales channel. An effective Data Warehouse is able to store all these information but in different tables, where a table represents different business activity. These tables can be further divided into sub tables when data is normalised.
So, we know we need to separate the tables based on the business activity. But, on the higher level we can make separation of these tables easier by differentiating between Operational and Analytical business processes.
Operational business processes are simply the key activities which must be performed to ensure continuity of the business such as selling of products. Whereas, Analytical processes as the name implies means analysing the performance of the business activities. The motto of Operational processes is to serve the customer.
The motto of Analytical processes is to create reports and implement analytics. The motto of Operational processes is to serve the customer so the parties involved in Operational processes do not really care about reporting or analytics.
Data is created during the Operational process because if there no product or service was sold then no data is created to be analysed. Therefore, the data goes into the warehouse from Operational processes and data comes out of the warehouse to do the analysis.
However, having all the workload of data joined on the same database is hard and inconvenient because it would slow down Analytical processes. The solution to solve the problem of speed gave birth to “Data Warehouse”.
Data Warehouse is a system which helps to implement Analytical processes in the most efficient manner
So, Date warehouse in a nutshell pulls in data from Operational processes and then throws the information to another database system which would make it easier to do the analysis.
This is where OLTP and OLAP comes in place. OLTP stands for Online Transaction Processing which simply stores data from Operational activities. Online Analytical Processes (OLAP) stores simplified data to make it easier to analyse data for reporting. So, OLTP helps to bring in data and OLAP helps to take out data to external tools to analyse the data such as Tableau or Python.
Technical Perspective
Similar to business perspective, the technical perspective of a Data Warehouse is that it is a copy of transaction data that was made to query and analyse the data in the most efficient manner. In other words, Data Warehouse was made with Analytical Processes in mind.
Data Warehouse stores dimensional (qualitative data in columns) and normalised (not duplicated) data. Within the Data Warehouse the data is updated in batches not in real-time.
Within business perspective, we talked about OLTP and OLAP databases and we mentioned that how data is transferred from one system to another for analytics to take place.
However, we did not mention the names or how it is done. It will be explained in this technical section. Basically, the process to transform data from operational to analytical processes is called ETL which stands for Export Transform and Load. ETL basically exports data from the source system used for operations, transform the data and loads it into a Dimensional model.
The dimensional model’s main advantage is that it helps to improve query performance. Then the data loaded onto the Dimensional Model is used to analyse and visualise data using Business Intelligence tools like Tableau or Power BI or programming language like Python or R.
How ETL Works
The process is straightforward but how does it do it. I got a screenshot from online best describing the approach. So, before we explain ETL we need to understand what is Normalisation. Normalization is simply the process of eliminating data redundancy and organising the data in the database.
However, it is a multi-step process to removes the duplicates from the “table” not “data”. Notice I mentioned table not data because we do not want to remove/delete duplicates it just that we want to separate tables wherever we see a duplicate value. More about normalisation will written in my future article as the discussion on normalisation is a different topic in itself. But for now all you need to know is that Normalization occurs in 3 levels or forms.
Data in OLAP tool is not NORMALISED
So, the data we have in OLTP tool is normalised as it records every transaction. Now if we have to visualise data from normalised form it can be very difficult. So, what ETL does is that it changes the data from normalised to de-normalised by joining the tables and if necessary adding new columns then loading it into facts and dimensions tables. More about facts and dimensions tables will be mentioned later in this article.
Dimensional Model
The ETL process is straightforward in sense that it gets data and outputs it to another source. However, it is not clear what is a Dimensional Model or how it works.
The schema of OLTP is something like below which is all over the place. It is called 3NF Schema as it is normalised to 3rd normal form. It is great to keep record of transactions in real-time but to do any analysis it will be a horror story. If you want to find one specific information (e.g. customer email-address who shopped yesterday in Milan) you can imagine the horror you will have to go through by joining multiple tables. Thus, it is not adequate and time consuming for OLAP.
Instead, we want to work with a schema then it is easy to join tables easily to retrieve information quickly and we call this schema Star Schema.
Star Schema joins the tables based on dimensions. Dimensions is basically any qualitative data (customer name, address, product). Basically anything which is not a number is a dimension.
Fact and Dimension Tables
Fact Table
As we have lot of data to take into account it would be great to differentiate the data even further during OLAP process and we differentiate the data based on two types, Fact and Dimension tables. A fact table is something which has occurred in the past but it mostly records measures/numerical data like an order book.
In other words imagine you are a salesman who goes around each house to make a sell. Whenever you make a positive sale, you would record what product was sold at what price and how many quantities of that product was sold to give total figure.
Dimension Table
A dimension table keep records of business events by taking into account the 4Ws; who, what, where, why. In other words it takes into account non-numeric. Eg. the dimension table will record the name of the salesmen, the house address where he/she made the sale, name of the customer and their email address etc.
The dimension tables column names are known as attributes because it is describing an entity. E.g. who the customer is, where the purchase was made, by whom.
Previously it was mentioned that Star Schema is used in OLAP but it is based on dimensions. Not it does not mean that measures (non-dimensions) are not included but it just means that it revolves around dimensions.
So, how it works is that the fact table is in the middle and the tables surrounding the fact table are the dimension tables. E.g, say you have movies database, so you would put any quantitative values table aka fact table in the middle and have all the attributes describing this value around it.
More will be written on Data Warehousing later as it would be too much to write in one article.