Hello this is my first medium article, this article purpose is a notebook from my Data Warehousing in AWS course. So let’s begin 😊.
Before I’m going to data warehousing, the first thing we need to know is basic about databases. In databases there are 2 types of database which is OLTP and OLAP.
Online Transaction Processing (OLTP) is mostly a relational databases, collect and manages transactional and operational data. OLTP example is Point-of-Sale (POS) terminals.
Online Analytical Processing (OLAP) is mostly a relational databases, perform complex analytical queries, and the data is imported from transactional or systems.
Here is the main different of OLTP and OLAP:
So what is a Data Warehouse?
A data warehouse is:
- A central repository of business data from disparate sources.
- A type of relational database that enables analysis of data.
- A collection of approved and trusted historical corporate data.
Data warehouse also have a goals to provide access to data, ensure consistency and quality of data, and enable querying of the data in different ways using common BI tools.
Data warehouse have a small subset that called a data mart. Data mart is used to model a single subject scoped to a department or other small entity within the business. Data mart is often organized in a star or snowflake schema.
What is star and snowflake schema?
Star schema is organized around a central table — called the fact table — that contains measurements for a specific event, such as the sale of an item. The fact table contains foreign key relationships to one or more dimension tables, which contain primary keys and descriptive attributes for the sale, such as the customer or product.
Snowflake schema is similar to a star schema, but the dimensions are split up to further normalize database normally form to 3NF.
when we discussing about data warehouse we also talk about big data, you know what big data is?
Big data is high-volume, high-velocity, and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. — Gartner
So now we can talk about data warehouse in AWS.
AWS have a tools that called Amazon Redshift that take care of many challenging task of deployment and ongoing maintenance of data warehouse. in Amazon Redshift we can integrate various data source into the Redshift.
There are 2 ways of interacting into Redshift
Amazon Redshift built on top of Massively Parallel Processing (MPP) architecture, MPP is a database architecture in which a master node (called a leader node in Amazon Redshift) distributes work to one or more slave nodes (called compute nodes in Amazon Redshift) in the database cluster.
When we create a database a traditional database store their data in a row-based.
Amazon Redshift have a different ways to store their data, Redshift using a columnar storage to make scanning larger data sets and complex analytics queries more fast and efficient.
This is an example how columnar storage perform faster than a row storage. Suppose you want to query on values in column_2 (represented in blue in the diagram). When data is stored by row, the database engine must scan through all of the values for each row to obtain the value from column 2 in each row. However, with columnar storage, Amazon Redshift must only locate the block (or blocks) containing data for column_2 to return the requested values.
When we create a data warehouse we must plan how big the cluster is and many thing you must consider like…
There are 2 type of cluster in AWS.
When using Amazon Redshift you can define how to launch your cluster, Amazon provide this settings using Amazon VPC
Maybe I will split this article in to 2 article, so the next article will talk about how to load data into the redshift and so on. You can find the next article here.
Thank you for reading
- Amazon Training Course