Introduction to AWS Redshift for Modern Data Warehousing

25 min readJun 6, 2023

What is Big Data?

Big Data is a term used to describe extremely large and complex datasets that cannot be easily processed or analyzed using traditional data processing tools.

With the proliferation of digital technologies, we generate vast amounts of data from various sources such as social media, sensors, online transactions etc. Big Data encompasses all this information, and it continues to grow rapidly.

The characteristics of Big Data are often described using the 3Vs: Volume, Velocity, and Variety. If we consider Big Data from social media,

● Volume refers to the sheer amount of data that is being generated and collected every day, with billions of posts, interactions, and user actions on social media platforms. The vast amount of information captured cannot be stored and processed using traditional methods like spreadsheets or regular databases.

● Velocity refers to the speed at which data is being generated and collected. Data is continuously generated and collected in real-time, with users posting, liking, and sharing content at a rapid pace. This results in a constant stream of data that needs to be processed quickly to derive insights and value.

● Variety refers to the different types of data that are being generated and collected, including structured, semi-structured, and unstructured data. Today, we are also collecting unstructured data such as images and videos, in addition to structured data like text and numbers. This variety makes it challenging to organize, process, and analyze the data effectively.

To manage and derive value from Big Data, specialized tools and techniques are required.

Importance of a Data Warehouse

In today’s world, businesses are generating more data than ever before. While this data offers a wealth of information, making sense of it and extracting valuable insights can be a challenge without effective organization data and analysis of same. This is where data warehousing comes in.

For organizations that need to manage and analyze large amounts of data, a data warehouse is essential. It enables them to make informed decisions based on the data, by providing a comprehensive view of the organization’s data in one place.

A data warehouse provides the following:

● Centralized Data Storage: With a data warehouse, all data is stored in one place, making it easier to manage and analyze. This eliminates the need for businesses to search through multiple sources to find the data they need.

● Data Integration: A data warehouse allows businesses to integrate data from various sources, including applications, relational databases, and external sources. This makes it possible to combine data from different systems and gain a more complete view of business operations.

● Efficient Analysis: With a data warehouse, businesses can perform complex queries, data analysis, and reporting to derive actionable insights. This enables them to make informed decisions based on their data.

● Scalability and Performance: A data warehouse can handle large datasets and provide high-performance processing. This makes it possible for businesses to store and analyze vast amounts of data, even as their needs grow over time.

Traditional Data Warehouse Challenges

Traditional data warehousing solutions had many challenges that made them insufficient for managing and analyzing Big Data. Some of these challenges include:

Lack of Scalability: Traditional data warehousing solutions were not designed to handle the increasing volume and velocity of data that is generated by Big Data sources. As a result, they often struggle to keep up with the demands of processing and analyzing large datasets.
Lack of Data Integration: Big Data comes from various sources with different formats and structures. Traditional data warehousing solutions were not designed to handle this level of data integration, making it difficult to combine and analyze data from different sources.
High Cost: Building and maintaining the infrastructure for traditional data warehousing solutions involves high upfront and ongoing costs. This can be a significant barrier for organizations that want to leverage Big Data for insights and decision-making.
Lack of Agility: Traditional data warehousing solutions are often inflexible and not designed to adapt quickly to changing business requirements. This can make it difficult to keep up with the evolving needs of the organization.
Low Performance: Traditional data warehousing solutions can struggle to process large and complex queries, leading to slow performance and delays.
Lack of Real-time Processing and Analysis: Traditional data warehousing solutions are often not designed to provide real-time processing and analysis, which can lead to missed opportunities and delays in decision-making.

To address these challenges, new technologies and approaches have emerged.

Modern Data Warehousing

A data warehouse is a central repository of information that enables organizations to manage and analyze large amounts of data. It involves collecting, storing, and managing data from various sources into a centralized location. This allows businesses to make more informed decisions by analyzing data.

Modern data warehousing solutions have overcome the challenges associated with traditional data warehousing, such as scalability and cost-effectiveness. These solutions use cloud-based technologies to provide great flexibility, scalability, and cost-effectiveness in managing large amounts of data.

With modern data warehousing, organizations can store and analyze vast amounts of data, enabling them to make better decisions by analyzing data. This allows them to respond quickly to changes in the market and stay ahead of the competition.

Modern Data Warehousing Solutions

There are several popular options for building and managing a data warehouse, including Microsoft Azure Synapse Analytics, Snowflake, Google BigQuery, and AWS Redshift.

This blog will focus on Amazon Redshift:

When comparing AWS Redshift to other data warehousing solutions, it is important to consider factors such as scalability, cost, ease of use, and integration with other tools.

Introduction to AWS Redshift

AWS Redshift is a cloud-based data warehousing service that allows businesses to store and analyze large amounts of structured and semi-structured data in a scalable and cost-effective manner. It is designed to handle petabyte-scale data processing and analysis tasks and is a fully managed data warehouse service provided by Amazon Web Services (AWS).

Traditional relational databases are great for handling moderate amounts of data, typically in the range of 100GB. However, when it comes to processing extremely large datasets of petabyte-scale, they may struggle to efficiently handle and process the data.

It is built on a columnar storage architecture, which enables faster query performance compared to traditional row-based storage systems. It uses techniques like data compression, parallel processing, and query optimization to deliver high-speed performance for complex analytical queries. This is where data warehousing solutions like Redshift come into play.

AWS Redshift Customers

Some of the companies that are actively using Redshift include Warner Brothers, KFC, McDonald’s, Toyota, Nasdaq, and Aramex.

The usage of RedShift is not limited to one industry and has the flexibility to adopt and cater to multiple diverse industry requirements.

● Pizza Hut uses Redshift to analyze revenue, transactions, and customer engagement scores, enabling data-driven decision-making for business growth and enhancing customer satisfaction.

● KFC uses Redshift to predict menu item demand, track sales trends, and make informed decisions to improve inventory management and optimize their menu offerings.

Leveraging Redshift, the aforementioned fast-food giants can analyze large amounts of data and make data-driven decisions that improve their business operations and enhance the customer experience.

Use Cases for AWS Redshift

AWS Redshift is suitable for a wide range of use cases, including:

● Data Warehousing: Redshift can be used as a centralized repository for all enterprise data, enabling organizations to store and manage large volumes of structured and unstructured data.

● Business Intelligence: Redshift can help organizations process and analyze large volumes of data to uncover insights that can inform business decisions. The technology can handle complex queries and provide visualizations to help users better understand data.

● Machine Learning: Redshift can be used as a data source for machine learning applications, providing access to large volumes of structured and unstructured data that can be used to train machine learning models.

● Data Analytics: With Redshift, organizations can analyze large volumes of data to identify patterns, trends, and anomalies. This can help with predictive modeling, fraud detection, and other applications that require processing and analyzing large amounts of data.

● Reporting: Redshift can be used to generate reports that provide insights into business performance. These reports can be customized to suit different business needs and can be generated in real-time, providing up-to-date information to decision-makers.

AWS RedShift Options

AWS RedShift offers two options for businesses to choose from:

● Provisioned Cluster

● Redshift Serverless

Provisioned Cluster allows businesses to manage the infrastructure, including the hardware and software, giving them full control over the configuration and management of the cluster. This option provides maximum flexibility and control, but also requires more expertise and resources to manage. The Provisioned Cluster is ideal for businesses that require maximum control over their infrastructure and have the expertise and resources to manage it.

Redshift Serverless, on the other hand, allows businesses to set up without any need to manage the infrastructure. AWS manages the infrastructure, including the hardware and software, and automatically scales up or down based on usage. This option provides a more hands-off approach to managing Redshift. It is ideal for businesses that want a more hands-off approach to managing their infrastructure.

The selection of the AWS RedShift option, whether it’s a provisioned cluster or serverless, depends on the specific use case and requirements of the organization. Businesses should evaluate their needs and choose the option that best fits their requirements, whether it’s maximum control or a more hands-off approach to managing Redshift.

AWS Redshift Architecture

The architecture of AWS RedShift is designed to provide businesses with a comprehensive solution that includes cluster and serverless options, integration services, and managed storage.

The architecture of AWS RedShift includes a cluster and serverless option that allows businesses to choose the option that best fits their business needs.

In addition to the cluster and serverless options, AWS RedShift also includes integration services that allow businesses to easily integrate their data with other AWS services.

AWS Redshift Provisioned Cluster

AWS Redshift architecture enables Redshift to deliver high-performance query execution and efficient data storage. This consists of a cluster with a leader node and multiple compute nodes.

Cluster Architecture

A Redshift cluster is the core component of the Redshift architecture. A cluster is a collection of computing resources called nodes, which are organized into a group called a “cluster”. The cluster is responsible for managing the data warehouse and processing queries. Each cluster has a unique endpoint, which is used to connect to the cluster and interact with the data warehouse. The client applications can connect via JDBC or ODBC connections to interact with Redshift.

Nodes and Node Types

There are two types of nodes in a Redshift cluster:

● Leader node: The leader node is responsible for managing client connections, parsing and optimizing incoming SQL queries, and coordinating the execution of these queries across the compute nodes. There is only one leader node per cluster, and it is automatically provisioned by AWS.

● Compute nodes: Compute nodes are responsible for storing data and executing queries. Each compute node has its own dedicated CPU, memory, and storage resources. The number of compute nodes in a cluster can be scaled up or down based on your data storage and processing requirements. Redshift offers different node types, each with varying levels of performance and storage capacity, allowing you to choose the most suitable option for your use case.

Node Slices

A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation. The number of slices per node is determined by the node size of the cluster.

AWS Redshift Serverless

AWS RedShift Serverless is a powerful solution that provides businesses with the flexibility and scalability they need to manage large amounts of data. By leveraging the Intelligent and Dynamic Compute Management of Redshift Serverless, businesses can optimize their infrastructure and gain insights from their data to make informed decisions.

Redshift Serverless separates compute and storage and introduces two workgroup and namespace:

Workgroup

A workgroup is a collection of compute resources. It groups together compute resources like RPUs, VPC, subnet and security groups.

Compute Capacity for Workgroup depends,

On the Assigned in RPUs (RedShift Processing Unit)
Configure Base Capacity in RPUs — The minimum capacity a query will use. 1 RPU provides 16 GB of memory.

Namespace

A namespace is a collection of database objects and users. It groups together data objects, such as databases, schemas, tables, users.

AWS Redshift Performance Features

AWS RedShift achieves extremely fast query execution by employing these performance features:

● Massively Parallel Processing (MPP): MPP is a technique used by AWS Redshift to process large amounts of data quickly. It works by breaking up a large data query into smaller parts and processing them all at the same time. This allows AWS Redshift to process the data in parallel, which makes it much faster than processing the data all at once.

● Columnar Data Storage: Columnar data storage is a way of organizing data in a database. Instead of storing all the data for one row together, it stores all the data for one column together. This makes it faster to search for specific data because you don’t have to search through all the rows to find it. It also allows for more efficient data compression because similar data is stored together.

● Data Compression: Data compression is a technique used by AWS Redshift to reduce the amount of storage space needed for data. It works by removing redundant or unnecessary data and encoding the remaining data in a more efficient way. This makes it faster to process the data because there’s less of it to work with.

● Query Optimizer: The query optimizer is a component of AWS Redshift that analyzes data queries and chooses the most efficient way to process them. It looks at all the different ways to process a data query and chooses the fastest and most efficient way to do it.

● Result Caching: Result caching is a technique used by AWS Redshift to store the results of a data query so that if you need to run the same query again, it can use the cached results instead of processing the data all over again. This can save a lot of time and resources because the data doesn’t have to be processed again.

● Compiled Code: Compiled code is a way of optimizing code so that it runs faster. It works by translating code into machine code that can be executed directly by the computer’s processor. This makes it faster to run the code because you don’t have to look up how to do it every time. AWS Redshift uses compiled code to optimize the performance of data queries.

Columnar Data Storage

Redshift stores data in a special way, in a columnar format, that is optimized for performance and scalability. Data is stored in columns rather than traditional row-based storage systems, which means that only the necessary columns are read during query processing. This makes queries run faster and makes it easier to store and manage large amounts of data.

The following series of illustrations describe how columnar data storage implements efficiency and how that translates into efficiencies when retrieving data into memory.

First Illustration: Row Storage

This first illustration shows how records from database tables are typically stored into disk blocks by row.

Second Illustration: Columnar Storage

The next illustration shows how with columnar storage, the values for each column are stored sequentially into disk blocks.

Columnar storage also takes up less disk space, because each block contains the same type of data, meaning it can be compressed into a specific format. Columnar Storage is an essential factor in optimizing query performance and resulting quicker outputs.

AWS Redshift Other Features

● Data backup and restore: Redshift provides automated backups of data in the cluster, with the ability to restore it to any point in time within a retention period, ensuring that data is always available and recoverable in the event of a failure or user error.

● Redshift Spectrum: Redshift Spectrum is a feature of AWS RedShift that enables you to run queries against data in Amazon S3, without having to load the data into Redshift first. This has been introduced to analyze unstructured data.

● Query monitoring and logging: Redshift provides monitoring and logging capabilities to help you identify and troubleshoot performance issues.

● Concurrency scaling: Redshift can automatically scale resources up or down based on query traffic, allowing for improved performance during peak usage periods.

● User-defined functions: Redshift supports user-defined functions (UDFs) in SQL, allowing you to extend the functionality of Redshift to meet business specific needs.

● Data Encryption: Redshift provides encryption of data at rest and in transit, using industry-standard encryption algorithms, helping to protect data from unauthorized access and ensure compliance with regulatory requirements.

AWS Redshift Benefits

● High Query Performance: Redshift’s columnar storage and parallel processing enable fast query performance over large datasets, making it easy to analyze and derive insights from large volumes of data.

● Scalability: Redshift can scale easily to handle data volumes ranging from a few hundred gigabytes to petabytes, making it a flexible and scalable solution that can grow with your business needs.

● Cost-Effective: Redshift offers pricing options that include on-demand and reserved instances, allowing users to pay for only the resources they use, making it a cost-effective solution for data warehousing. It has the pay-as-you-go pricing model that allows for flexible scaling.

● Ease of Use: Redshift’s web-based console allows users to set up and manage data warehouses, configure security settings, and monitor performance easily, making it easy to use and manage.

● Security and compliance: Redshift provides advanced security features and compliance capabilities. Its security features include data encryption at rest and in transit, as well as integration with AWS Identity and Access Management (IAM) for user and resource access control, while its compliance capabilities enable it to meet various regulatory requirements such as HIPAA, PCI DSS, and SOC 2.

● Integration: Redshift integrates with other AWS services and third-party tools, enabling users to build custom data analytics solutions that fit their specific needs, making it easy to integrate Redshift with their existing data infrastructure and use their preferred analytics tools. AWS services like S3, EMR, Glue SageMaker and QuickSight and 3rd Party business intelligence tools and analytical tools.

● Data Ingestion: Redshift provides multiple options for ingesting data into the RedShift, including direct loading from Amazon S3, using the COPY command, or streaming data in real-time using Amazon Kinesis Data Firehose, making it easy to load large amounts of data into Redshift for analysis.

This shows how AWS RedShift can integrate with AWS services and third-party tools for data management and analysis.

Limitations of Redshift

While AWS RedShift is a powerful data warehousing solution, it does have some limitations which have been listed below:

● Limited support for unstructured data. Redshift is optimized for structured data and may not be the best solution for analyzing unstructured data. Additionally, the Redshift Spectrum feature cannot support unstructured data.

● Limitation of AWS RedShift is its limited support for real-time data. Redshift is designed for batch processing and may not be the best solution for real-time data processing and analysis. Businesses that require real-time data processing and analysis may need to consider other solutions, such as Amazon Kinesis or Apache Kafka.

Pricing Options

RedShift offers, different pricing options to suit various needs and usage patterns:

AWS RedShift On-Demand Pricing:

No commitments to long-term contracts or upfront payments. Instead, you pay on an hourly basis for the provisioned capacity of the specific node type and quantity you have running. This pay-as-you-go model allows for greater scalability and cost control, as you only pay for the resources you actually use.

2. AWS RedShift Reserved Instances Pricing:

This commitment-based model requires you to commit to using AWS RedShift for either one or three years, in exchange for discounts of up to 75% off the On-Demand pricing. Reserved Instances offer three pricing options:

a. All Upfront: With this option, you pay the full amount upfront and receive a substantial discount, making it an ideal choice for businesses with predictable workloads and long-term usage requirements.

b. Partial Upfront: Here, you pay a portion of the amount upfront and receive a smaller discount. This option suits businesses that have a moderate level of predictability in their workloads but also require some flexibility.

c. No Upfront: With this option, you pay nothing upfront, but still benefit from a small discount. This is suitable for businesses that prioritize flexibility and have dynamic workload patterns.

3. AWS RedShift Serverless Pricing:

Redshift Serverless is a unique pricing model that offers automatic startup, shutdown, and capacity scaling based on your application’s needs. With no upfront costs or commitments, you only pay for the compute capacity you actually utilize. Redshift Serverless charges you solely when it actively processes workloads, allowing for efficient cost management and scalability.

4. Concurrency Scaling in AWS RedShift:

Concurrency Scaling is a valuable feature that enables you to scale the compute capacity of your Redshift cluster up or down to match your workload demands. By dynamically adding or removing resources as needed, you can ensure optimal performance and cost efficiency. This feature automatically removes unnecessary resources once they are no longer required, resulting in cost savings.

Best Practices for RedShift

Query Optimization Techniques

AWS Redshift provides several query optimization techniques to improve performance.

Data Distribution Style

Redshift distributes data across based on the chosen distribution style, which minimizes data movement during query execution. When creating a table, specify a distribution style that determines how the data is distributed. There are 3 distribution styles available:

Even: The data is distributed to all data slices in a round-robin fashion. The even distribution is appropriate when you are not using the table in any kind of joins. You can choose even distribution in case you are not clear on using KEY or ALL distribution.
Key: The rows are distributed according to the values in one column. If two tables are distributed on the same column and when you join those two tables on the distribution column then the required data is available in the same data slice thus making collocated tables. The collocated tables improve the performance.
All: A copy of the entire table is distributed to every node. ALL distribution multiplies the storage required and so it takes much longer to load, update, or insert data into multiple tables.
Auto: When using the AUTO distribution style in AWS RedShift, the optimal distribution style for a table is automatically assigned based on its size. For example, if AUTO distribution style is specified, AWS RedShift initially assigns the ALL distribution style to a small table. When the table grows larger, AWS RedShift might change the distribution style to KEY, choosing the primary key (or a column of the composite primary key) as the distribution key. If the table grows larger and none of the columns are suitable to be the distribution key, AWS RedShift changes the distribution style to EVEN.

CREATE TABLE Person (id int, name varchar(100), age int ) DISTSTYLE EVEN;
CREATE TABLE Person (id int, name varchar(100), age int) DISTSTYLE KEY DISTKEY(ID);
CREATE TABLE Person (id int, name varchar(100), age int) DISTSTYLE ALL;

Data Sorting Keys

Redshift organizes data on disk based on sort keys, which can improve query performance by reducing the amount of data that needs to be read. You can define one or more sort keys for a table, which determine the order in which data is stored. To have AWS RedShift choose the appropriate sort order, specify AUTO for the sort key.

Data Compression

Redshift supports several compression algorithms and automatically compresses data which can help reduce storage costs. This can help to reduce storage costs and improve query performance by reducing the amount of data that needs to be read from disk. ENCODE AUTO is enabled by default for tables and manages compression encoding for all columns in the table.

You can also use column encoding to further optimize data compression and reduce storage requirements. If you specify compression encoding for any column in the table, the table is no longer set to ENCODE AUTO.

Data Loading and Unloading

Data Loading

The method to load data into Redshift:

● COPY command: The COPY command is the most common and efficient method for loading data into Redshift. It loads data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or remote hosts using SSH.

● The COPY command supports various data formats, such as CSV, JSON, Avro, and Parquet.

Data Unloading

The primary method for unloading data from Redshift:

● UNLOAD command: The UNLOAD command exports the result of a SELECT query to one or more files in Amazon S3. This command is useful for exporting large datasets from Redshift to other systems or for backup purposes.

Best Practices for Data Loading and Unloading

Best practices for data loading and unloading include using compression, splitting large files, and optimizing data formats.

● Use compression: Compress your data files before loading them into Redshift to reduce the amount of data transferred and improve load performance.

● Split large files: Break large data files into smaller chunks to enable parallel loading and improve performance.

● Optimize data formats: Use columnar data formats like Parquet or ORC, which are more efficient for Redshift’s columnar storage architecture.

● Use manifest files: When using the COPY command, create a manifest file that lists all the data files to be loaded. This ensures that Redshift loads all the specified files, even if some are temporarily unavailable.

● Clean up staging data: After loading or unloading data, clean up any temporary or staging data in Amazon S3 to avoid unnecessary storage costs.

By following these best practices and using the appropriate data loading and unloading methods, you can efficiently manage data in your AWS Redshift data warehouse.

Hands on Demo — Getting Started

A hands-on demo can help understand the practical aspects of AWS Redshift Serverless. The demo should cover prerequisites, creating a Redshift cluster, loading sample data, querying the data, unloading data, and cleaning up resources.

Setup Prerequisites:

Make sure you have an AWS account with the necessary permissions to create and manage Redshift resources.

An AWS Account

Create an AWS account by visiting the site.
https://aws.amazon.com/resources/create-account/

2. Create the IAM role to access RedShift.

Go to the Console and select IAM.
Go to the Roles pane on the left side under Access management.
Click on Create Role

Steps to Follow:

STEP 1:

● Login and go to the AWS Management Console and navigate to the RedShift console to create a Serverless.

STEP 2:

In the Dashboard.

● Create the Workgroup:

A workgroup is a collection of compute resources

● Set the capacity (RPUs):

Set the base capacity used to process your workload.
To improve query performance, increase your RPU value.

● Network and security:

Provide the VPC, Security Group and the Subnets.

● Create the namespace:

A namespace is a collection of database objects and users.

● Set the Permissions:

Set the IAM Role

● Finally Review and Create.

● Once created it will show under Namespaces / Workgroups.

STEP 3:

Additionally, the following options are available.

● Workgroup -> Limit -> Usage Limit -> Manage Usage Limit

Define the maximum Redshift processing units (RPUs)

● Workgroup -> Limit -> Query Monitoring Rules -> Manage query limits.

Limit the resources used by a query.

● Namespace -> Data backup -> Snapshots

Create a snapshot.

● Namespace -> Data backup ->Recovery Points

Recover from the Point In Time

STEP 4:

Click on Query Data and it will take to the Query Editor

● Queries can be executed in this editor.

● AWS Redshift supports standard SQL, including SELECT, INSERT, UPDATE, DELETE, and CREATE statements, as well as many SQL functions and operators. Redshift also provides some extensions specific to the service, such as window functions, user-defined functions (UDFs), and stored procedures. This SQL support allows you to use familiar query languages and tools to interact with your data in Redshift.

STEP 5:

Take a look at the connection, databases, schemas, and tables.

STEP 6:

Create a database by running the command.

CREATE DATABASE redshift_demo;

STEP 7:

Create a Schema by running the command.

● 2 Types of Schemas

1. Local Schema

2. External Schema

CREATE SCHEMA IF NOT EXISTS sales_related_schema;

STEP 8:

Create an external scheme.

● An external schema will be useful if you want to create an external table.

CREATE EXTERNAL SCHEMA external_sales_schema
from data catalog database 'redshift-demo'
IAM_ROLE {IAM-Credentials}
CREATE EXTERNAL DATABASE IF NOT EXISTS;

STEP 9:

Create a Table in the default schema by running the command.

● Provide the distribution style and the sort key.

CREATE TABLE IF NOT EXISTS {TABLE_NAME} (
column1 {datatype},
column2 {datatype},
column3 {datatype},
column4 {datatype}
)
DISTSTYLE {distribution style};

Tmp_Customer Table

CREATE TABLE IF NOT EXISTS public.tmp_customer (
customer_id integer NOT NULL,
first_name character varying(256),
last_name character varying(256),
email character varying(256),
gender character varying(256),
phone character varying(256),
date_of_birth date,
signed_up_date date,
active_status boolean,
PRIMARY KEY (customer_id)
)
DISTSTYLE AUTO
SORTKEY(signed_up_date);

Tmp_Orders Table

CREATE TABLE public.tmp_orders (
order_id character varying(256) NOT NULL,
restaurant_id character varying(256),
no_of_items integer,
net_total integer,
discount integer,
total integer,
customer_id integer,
order_status character varying(256),
order_type character varying(256),
source character varying(256),
sales_date date,
PRIMARY KEY (order_id)
)
DISTSTYLE AUTO
SORTKEY(sales_date);

Tmp_Restaurant Table

CREATE TABLE public.tmp_restaurant (
restaurant_id character varying(256) NOT NULL,
restaurant_name character varying(256),
city character varying(256),
timezone character varying(256),
PRIMARY KEY (restaurant_id)
)
DISTSTYLE AUTO
SORTKEY(city);

STEP 10:

Create a Table in the Scheme by running the command.

● To create a table with the schema created, then mention it with the table name.

Customer Table

CREATE TABLE IF NOT EXISTS sales_related_schema.customer (
customer_id integer NOT NULL,
first_name character varying(256),
last_name character varying(256),
email character varying(256),
gender character varying(256),
phone character varying(256),
date_of_birth date,
signed_up_date date,
active_status boolean,
PRIMARY KEY (customer_id)
)
DISTSTYLE AUTO
SORTKEY(signed_up_date);

Orders Table

CREATE TABLE sales_related_schema.orders (
order_id character varying(256) NOT NULL,
restaurant_id character varying(256),
no_of_item integer,
net_total integer,
discount integer,
total integer,
customer_id character varying(256),
order_status character varying(256),
order_type character varying(256),
source character varying(256),
sales_date date,
PRIMARY KEY (order_id)
) DISTSTYLE AUTO
SORTKEY(sales_date);

Restaurant Table

CREATE TABLE sales_related_schema.restaurant (
restaurant_id character varying(256) NOT NULL,
restaurant_name character varying(256),
city character varying(256),
timezone character varying(256),
PRIMARY KEY (restaurant_id)
) DISTSTYLE AUTO
SORTKEY(city);

STEP 11:

Create a table with compression encoding for columns.

● By default, ENCODE AUTO is automatically enabled for all columns in the table.

● ENCODE AUTO is the default for the table.

● But if needed you can specify the encoding type, then you need to define it with the column name.

● From the UI, define the Column name, data type and the Encoding or by running the command.

CREATE TABLE public.tmp_customer_compressed (
customer_id integer NOT NULL ENCODE raw,
first_name character varying(256) ENCODE raw,
last_name character varying(256) ENCODE raw,
email character varying(256) ENCODE lzo,
gender character varying(256) ENCODE lzo,
phone character varying(256) ENCODE lzo,
date_of_birth date ENCODE raw,
signed_up_date date ENCODE raw,
active_status boolean ENCODE raw,
PRIMARY KEY (customer_id)
) DISTSTYLE AUTO
SORTKEY(signed_up_date);

STEP 12:

Create an external table on S3.

● When needed to create an external table in S3, an external schema is a must.

● External tables do not support distribution styles, key or sort.

CREATE EXTERNAL TABLE external_sales_schema.customer (
customer_id integer,
first_name character varying(256),
last_name character varying(256),
email character varying(256),
gender character varying(256),
phone character varying(256),
date_of_birth date,
signed_up_date date,
active_status boolean
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://{bucket-name}/customer';

STEP 13:

Load Data from a CSV from S3 to a table.

● Prepare a sample dataset in CSV format and upload it to an Amazon S3 bucket.

○ Prerequisites : customer.csv, orders.csv and restaurant.csv

● Create a table that matches the sample dataset.

● Use the COPY command to load the data from the S3 bucket into the table by running the command.

COPY tmp_customer FROM 's3://{bucket-name}/csv/Customer.csv' IAM_ROLE '{IAM-Credentials} FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS 'us-east-1';

COPY tmp_orders FROM 's3://{bucket-name}/csv/Orders.csv' IAM_ROLE '{IAM-Credentials}' FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS 'us-east-1';

COPY tmp_restaurant FROM 's3://{bucket-name}/csv/Resturant.csv' IAM_ROLE '{IAM-Credentials}' FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS 'us-east-1';

STEP 15 :

From the temporary table insert/merge to the Main table.

INSERT INTO sales_related_schema.customer SELECT * FROM tmp_customer;

INSERT INTO sales_related_schema.orders SELECT * FROM tmp_orders;

INSERT INTO sales_related_schema.restaurant SELECT * FROM tmp_restaurant;

STEP 16:

Load updated data from a CSV and add to the table.

COPY public.tmp_customer FROM 's3://{bucket-name}/updated_csv/Customer.csv' IAM_ROLE '{IAM-Credentials} FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS 'us-east-1';

STEP 17:

Merge the data to the table.

MERGE INTO sales_related_schema.customer USING tmp_customer B
ON (sales_related_schema.customer.customer_id = B.customer_id )
WHEN MATCHED THEN
UPDATE SET
first_name = B.first_name,
last_name = B.last_name,
email = B.email,
gender = B.gender,
phone = B.phone,
date_of_birth = B.date_of_birth,
signed_up_date = B.signed_up_date,
active_status = B.active_status
WHEN NOT MATCHED THEN
INSERT VALUES
(
B.customer_id,
B.first_name,
B.last_name,
B.email,
B.gender,
B.phone,
B.date_of_birth,
B.signed_up_date,
B.active_status
);

● The Merge query will check if the record already exists. If it exists it will update otherwise it will insert to the table.

STEP 18 :

Query the Data

Run a few example queries on the sample data to demonstrate query capabilities.

1. Verify the Data in the tables.

● SELECT QUERY

SELECT * FROM "sales_related_schema"."customer";

● COUNT QUERY

SELECT count(*) FROM "sales_related_schema"."customer";

2. Based on the created 3 tables, generate a set of summary tables.

CREATE SCHEMA IF NOT EXISTS reporting;

Report 1 — Order by Restaurant

CREATE TABLE reporting. orders_by_restaurant AS
SELECT
restaurant_id, sales_date, count(*)
FROM
sales_related_schema.orders
WHERE order_status='CLOSED' AND
sales_date between '2023–01–01' AND '2023–03–31'
GROUP BY restaurant_id, sales_date;

Report 2 — Sales by Restaurant

CREATE TABLE reporting.sales_by_restaurant AS
SELECT
restaurant_id, sales_date, sum(total)
FROM
sales_related_schema.orders
WHERE order_status='CLOSED'
GROUP BY restaurant_id, sales_date;

Report 3 — No of Orders by Customers

CREATE TABLE reporting. no_of_orders_by_customer AS
SELECT
customer_orders.customer_id,
count(customer_orders.order_id)
FROM
(
SELECT
A.customer_id, B.order_id
FROM
sales_related_schema.customer AS A JOIN sales_related_schema.orders AS B
ON A.customer_id=B.customer_id
WHERE B.order_status='CLOSED'
GROUP BY A.customer_id, B.order_id
) AS customer_orders
GROUP BY customer_id;

Report 4 — Find top 10 buyers by quantity.

CREATE TABLE reporting.top_10_customers_by_quantity AS
SELECT first_name, last_name, total_quantity
FROM (
SELECT customer_id, sum(no_of_item) as total_quantity
FROM sales_related_schema.orders where customer_id is not null AND customer_id!=''
GROUP BY customer_id
ORDER BY total_quantity desc limit 10
) Q, sales_related_schema.customer
WHERE Q.customer_id = sales_related_schema.customer.customer_id
ORDER BY Q.total_quantity desc;

STEP 19:

Unload Data

● Use the UNLOAD command to export data from Redshift to an S3 bucket.

UNLOAD ('select * from reporting.top_10_customers_by_quantity')
to 's3://{bucket-name}/'
IAM_ROLE '{IAM-Credentials}
header
CSV DELIMITER AS ','
allowoverwrite
parallel off
EXTENSION 'top_10_customers_by_quantity_report.csv';

In this way, with the use of the RedShift for data warehousing, Data scientists can generate the queries and reports and analyze these data efficiently and gain insights which will be paramount to make informed decisions.

The next step is to automate this process by creating a Data Pipeline with the use AWS Glue. Users can easily create and run ETL jobs that extract data from various sources, transform the data, and load it into Redshift. This integration eliminates the need for manual data movement and allows users to focus on analyzing their data.

Conclusion

AWS Redshift is a powerful and flexible cloud-based data warehousing solution that can help organizations to store, manage, and analyze large volumes of data in a cost effective and efficient way. It provides scalability, performance, and integration capabilities for efficient data storage, analysis, and decision-making. With the ease of use, the RedShift is an ideal choice for businesses of all sizes and across various industries to gain valuable insights from their data, enabling them to make informed decisions and drive business success. It is suitable for a wide range of use cases, including data warehousing, big data analytics, real-time analytics, and machine learning. To learn more, explore the official AWS Redshift documentation and other resources.