Building Data Warehouse. Part 4: Using data warehouse
On my previous article we stopped on setting up the data warehouse database.
Today I will explain how you can use it to get a value for company business. Also we will discuss data pipeline architecture and its variations.
This is the last article of a series about company data warehouse. In this series I explain:
- WHY you should build a company data warehouse
- WHAT preparation you have to do
- HOW you can build it
- HOW you can use it. (👈 we’re here)
Layers of pipeline architecture
During this articles I would like to introduce some terms that I skipped at the beginning of these series as I didn’t want to make things complicated for you. But as you are almost a data engineer let deep dive into some new terms that will help you to understand the end of this article.
First is Layer. Layer is a logical part of data pipeline. Usually data pipeline is divided into 3 layers:
- Data source layer
- Data storage layer
- Presentation layer
As we discussed data source and storage layers in my previous articles, it’s time to get acquainted with presentation layer.
Presentation layer
Presentation layer is the last building and most important block in the architecture. Frankly speaking it’s a layer for what we built the entire data pipeline!
It is also called a business intelligence (BI) interface and gives access to our data for end users, that are usually data or business analysts that use a presentation layer to create dashboards with data visualisation using BI tools, build business-reports, use data for data science/machine learning, custom report tooling, etc.
Let’s discuss every tool in detail.
Business Intelligence tool
BI tool is an instrument that helps you work with data from a business perspective:
- Exploring data
- Aggregate data (e.g group sales by month and find average)
- Visualizing data in any way: build charts, trend lines, pivot tables
- Build dashboards with charts and metrics (e.g sales dashboard with sales by day, sales by month, top sales companies, pending sales, etc)
- Set alerts on critical data parts (e.g if sales count is less than 100 at the end of the day — send me email of slack notification)
There are plenty of tools on the BI market that you can use: Tableau, Microsoft Power BI, Metabase, Apache Superset, Qlik, etc.
As usual the choice depends on different factors:
- Price (there are great open source alternatives such as Metabase and Apache Superset)
- Ecosystem you use (e.g. if your company uses Microsoft products Power BI looks good choice for you)
- Provided set of functions (Tableau is very powerful and flexible)
- etc
Business/custom reports
BI tool is a powerful instrument but, of course, it cannot cover all possible needs. For example, if your client wants to get a specific report every Monday on email it’s the case.
As you have cleaned and well-organized data in one place you can build any reporting tools or systems using in-house team or hire a freelancer for this purpose.
Data Science
If you have ever built data science systems you know that first and the most important part of building system is collecting, cleaning and organizing data. But wait, we have already done it during building the data warehouse. That’s why it is prefect for data science tasks.
Architecture variations
In my second article I said that I would explain the standard way to build data pipeline architecture but I can leave you without briefly explaining the others.
Tiers of pipeline architecture
There is another term that I want to share: N-tier architecture. It says about number of stages that you have in your data storage layer.
Let’s overview every architecture by examples.
Two-tier architecture
That’s what we built in these series of articles. This architecture is called two-tier because we have 2 databases in storage layers: data lake and data warehouse.
One-tier architecture
In this architecture you build only a data lake and connect the presentation layer to it.
There are bunch of reasons to select this architecture:
- No need to create data warehouse (saves money, time and effort to build and support it)
- Modern BI tools can work with unstructured data (for example Amazon Athena can query data in json files stored in S3 buckets)
- ETL jobs can process unstructured data and store handled data back to data lake
This approach became very popular nowadays and has its own name Lakehouse.
Three-tier architecture
According to this architecture you should create the next stage in the data storage layer that is built from a data warehouse. This stage is called Data mart.
Data Marts are created for a special subdomain of your product, for business function or group of people. For example, there might be a data mart that contains information only about product and its deliveries and another data mart with whole company information for the financial department.
On the image above I drew data mart as a separate database, but in reality data mart is usually created by using data warehouse views.
Summary
Presentation layer gives access to data for end users through BI tools, reporting systems and data science.
Data architecture might be one, two and three tiers depending on how much stages your data storage layer has.
My congratulations, you are a data engineer now! I’m joking. This rabbit whole is deeper than you think and we didn’t cover many nuances and pitfalls but I gave you strong fundamentals which help you to build data architecture at any scale.
Thank you for reading! 🙏