Understanding the foundations of Data Warehouse and AWS Redshift
In this article we will be focusing on the foundations of Data Warehouse like What is a Data warehouse? And then we will look at one of the managed Data Warehouse services in the cloud called Amazon Redshift. We will also be looking at some of the basics concepts for it and see some comparison for on-premise and cloud Data Warehousing solutions.

MS Excel is one of the most widely used tools by Analyst across the globe. It gives you the opportunity to store and analyze data in a reader friendly tabular format. It’s used by Analyst, Accountants, developers and Experts alike. But as we are storing more and more data MS Excel is reaching a limit. You can view up-to 1 million rows in excel. But when we need to store data consisting of many years Databases or Data Warehouse comes into play.
Before moving any further let’s look into these terminologies a bit and then we will move to Redshift and its capabilities.
Databases, Schema and table
Databases
As per Wikipedia definition
“In computing, a database is an organized collection of data stored and accessed electronically from a computer system.”
What it means is that database is a collection of data which can be in the form of text, images, files, audios, videos etc. Databases can be structured and unstructured. A structured database will be very similar to an Excel worksheet and can be manipulated using a high-level language for example SQL. Unstructured Databases require more in-depth knowledge of data and other than some exceptions can be converted to a Schema.
There are different types of Databases:
- Relational Databases
- NoSQL Databases
- Graph Databases
- Object Stores
- Distributed Databases
Our focus will be will be on Relational Databases for this article. Relation Databases follow a rigid structure on write. Meaning when data is being written to such systems it must conform to a structure to be ingested. This makes it most widely used type. It eliminates the possibility of some unknown value passing all the way to your end user application. Relational Databases use keys to identify a record.
Tables
A table consists of related data in the form of vertical and horizontal lines. Vertical lines are called columns and horizontal line are called rows. Each entry or row in the table is called a record. One or more columns help uniquely identify a record in a table.
Schema
A Schema in a database represents the logical relationships in databases for one or many parts of the business. If an organization has a production database which holds data across organization, then the logical relationship for one of the segments businesses for example marketing is called Schema for that organization. This logical representation helps us in complex analysis across multiple segments.

Data Warehouse
A data warehouse works as a central repository of historic data from one or more data sources. Generally, when talk about Data Warehouse, it works as an OLAP system on top of multiple OLTP systems across an organization.
It’s a strictly logical concept meaning we connect the data with the help of Schemas and tables. Data Warehouse are created to answer complex analytical queries with the help of a high-level language. Most commonly used language in a data warehouse is SQL.
Some of the popular data warehousing solutions are as follows:
- Amazon Redshift
- Snowflake
- IBM Db2
- Microsoft Azure Synapse
- Teradata
Our focus will be on Amazon Redshift.
There are mainly 3 types of schemas that are followed in Data Warehousing:
- Star Schema

2. Snowflake Schema

3. Galaxy Schema
Amazon Redshift
“Amazon Redshift is a fully managed petabyte-scale data warehouse service in the cloud.”
Being a fully managed service means that AWS is responsible for monitoring, scaling, security, patching, backups and underlying hardware. You can focus on making a robust architecture for your products and services. User is responsible for creation of databases, schema, settings and performance tuning.
Amazon Redshift is based on PostgreSQL. It’s specifically designed for online analytical processing (OLAP) and business intelligence applications which require complex queries against large datasets. It’s still different from some other PostgreSQL implementations. For example, one of the major differences is OLTP systems usually store data in rows but Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O.
Data in Amazon Redshift is automatically backed up to Amazon S3, and Amazon Redshift can asynchronously replicate your snapshots to S3 in another region for disaster recovery.
There are multiple features that enhance the reliability of your data warehouse cluster. For example, Amazon Redshift continuously monitors the health of the cluster, and automatically re-replicates data from failed drives and replaces nodes as necessary for fault tolerance. Clusters can also be relocated to alternative Availability Zones (AZ’s) without any data loss or application changes.
On an on-premise solution we manage the hardware, infrastructure, security and then upgrade or replacement of the machine. But in cloud since a customer doesn’t own any of these things everything happens behind a wall especially in managed services. Scalability is a very big plus point since these systems scale automatically based on your usage or you can manually intervene and scale them up if required. Since you don’t have to manage hardware operations you can better focus on your products and customers.
Conclusion
Organizations built a data warehouse for historic safe keeping of their data assets. There are many flavors in data warehouse solutions. And many different strategies to implement them. The choice relies on your need and the type of work you will be doing. Databases and tables are largely the building blocks of a data warehouse. The logical relation is presented in the form of schemas. You can either buy the machines and own everything in your possession and build a warehouse or you can build one through a cloud service provider. Amazon Redshift is a fully managed data warehouse solution by AWS on the Cloud.