How to create a Redshift stack with AWS CloudFormation

Use this CloudFormation template to launch Redshift into your VPC subnet with S3 as the data source

This is the seventh and final article in our Infrastructure as Code blog series. You can read the first six here:

Our third and final template creates an Amazon Redshift stack. Redshift is a data warehousing solution that allows you to run complex data queries on huge data sets within seconds (it’s pretty awesome). You can use it to generate reports and analyze customer data. This stack will help you get up and running with Redshift.

There are a number of ways to get your data into Redshift. In this template, we use S3 as the data source.

For simplicity, we’ll put Redshift in a VPC subnet so that you can connect directly to it without setting up a VPN or proxy (note: we don’t recommend this for production environments). For some baseline security, Redshift will be locked down to your specific IP address.

Here’s the entire Redshift template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift Stack
Conditions:
SingleNode: !Equals [ !Ref RedshiftNodeCount, 1 ]
Parameters:
SubnetA:
Type: String
Type: AWS::EC2::Subnet::Id
Description: Make sure this belongs to the VPC specified below (e.g. 172.31.0.0/20)
SubnetB:
Type: String
Type: AWS::EC2::Subnet::Id
Description: Make sure this is different from the subnet above (e.g. 172.31.16.0/20)
VPCID:
Type: String
Type: AWS::EC2::VPC::Id
Description: Select a VPC (e.g. 172.31.0.0/16)
DataBucketName:
Type: String
Description: S3 data bucket name
DatabaseName:
Type: String
Description: Database name
MasterUsername:
Type: String
Description: Master user name for Redshift
Default: admin
MasterUserPassword:
Type: String
Description: Master password for Redshift (used mixed case and numbers)
NoEcho: true
DeveloperIPAddress:
Type: String
Description: Your public IP address (see http://checkip.dyndns.org/)
RedshiftNodeCount:
Type: Number
Description: Number of Redshift nodes
Default: 1
MinValue: 1
ConstraintDescription: Must be a number greater or equal to 1
Resources:
RedshiftCluster:
Type: AWS::Redshift::Cluster
Properties:
ClusterSubnetGroupName: !Ref RedshiftClusterSubnetGroup
ClusterType: !If [ SingleNode, single-node, multi-node ]
NumberOfNodes: !If [ SingleNode, !Ref 'AWS::NoValue', !Ref RedshiftNodeCount ] #'
DBName: !Sub ${DatabaseName}
IamRoles:
- !GetAtt RawDataBucketAccessRole.Arn
MasterUserPassword: !Ref MasterUserPassword
MasterUsername: !Ref MasterUsername
PubliclyAccessible: true
NodeType: dc1.large
Port: 5439
VpcSecurityGroupIds:
- !Sub ${RedshiftSecurityGroup}
PreferredMaintenanceWindow: Sun:09:15-Sun:09:45
DataBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub ${DataBucketName}
RawDataBucketAccessRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
-
Effect: Allow
Principal:
Service:
- redshift.amazonaws.com
Action:
- sts:AssumeRole
RawDataBucketRolePolicy:
Type: AWS::IAM::Policy
Properties:
PolicyName: RawDataBucketRolePolicy
PolicyDocument:
Version: 2012-10-17
Statement:
-
Effect: Allow
Action: s3:ListAllMyBuckets
Resource: arn:aws:s3:::*
-
Effect: Allow
Action:
- 's3:Get*'
- 's3:List*'
Resource: '*'
-
Effect: Allow
Action: cloudwatch:*
Resource: "*"
Roles:
- !Ref RawDataBucketAccessRole
RedshiftClusterSubnetGroup:
Type: AWS::Redshift::ClusterSubnetGroup
Properties:
Description: Cluster subnet group
SubnetIds:
- !Ref SubnetA
- !Ref SubnetB
RedshiftSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Enable JDBC port
VpcId: !Ref VPCID
SecurityGroupIngress:
-
CidrIp: !Sub ${DeveloperIPAddress}/32
FromPort: 5439
ToPort: 5439
IpProtocol: tcp
Description: IP address for your dev machine
-
SourceSecurityGroupId: !Ref AccessToRedshiftSecurityGroup
FromPort: 5439
ToPort: 5439
IpProtocol: tcp
Description: Access to redshift
AccessToRedshiftSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Access to Redshift access
VpcId: !Ref VPCID
InternalSecurityGroupIngress:
Type: AWS::EC2::SecurityGroupIngress
Properties:
IpProtocol: tcp
FromPort: 0
ToPort: 65535
SourceSecurityGroupId: !Ref RedshiftSecurityGroup
GroupId: !Ref RedshiftSecurityGroup
Outputs:
RedshiftClusterEndpointAddress:
Description: Redshift Cluster Endpoint Address
Value: !GetAtt RedshiftCluster.Endpoint.Address
RedshiftClusterEndpoint:
Description: Redshift Cluster Endpoint
Value:
Fn::Join:
- ""
- - 'jdbc:redshift://'
- !GetAtt RedshiftCluster.Endpoint.Address
- ':5439/'
- !Sub ${DatabaseName}

You can download this CloudFormation template by clicking here.

Let’s explore what’s going on here.

Creation of the Redshift cluster

The first thing we do is create the Redshift cluster.

Please note that the code snippet below is simplified for demonstration purposes and doesn’t yet match the code we provided in the overall template above. We’ll revisit and explain the additional Redshift properties in a later section.

.
.
.
Resources
RedshiftCluster:
Type: AWS::Redshift::Cluster
Properties:
ClusterType: SingleNode
NumberOfNodes: 1
DBName: !Sub ${DatabaseName}
MasterUserPassword: !Ref MasterUserPassword
MasterUsername: !Ref MasterUsername
PubliclyAccessible: true
NodeType: dc1.large

Here are the key aspects of this code:

  • “ClusterType”: This can be “SingleNode” or “MultiNode”. For now, we hard-code “SingleNode”.
  • “NumberOfNodes”: Since we’re using “SingleNode”, this has to be set to 1.
  • “DBName”: This refers to a parameter in the Parameters section called “DatabaseName”, which becomes the name of our Redshift database.
  • “MasterUsername”: This is another parameter that sets the master user name.
  • “MasterUserPassword”: This is also a parameter for setting the master password.
  • “PubliclyAccessible”: This is set to true so that you can connect to it easily.
  • “NodeType”: “dc1.large” is the least expensive node type

Like we mentioned prior, there are a few more Redshift properties that we’ve included in our overall template that we’ll explain in a later section titled “More Redshift cluster properties”.

Set up S3 as a data source

Redshift can load data from different data sources. In this example, we’ll be using S3.

To set this up, we have to create an S3 bucket and an IAM role that grants Redshift access to S3. This is what the code looks like:

Resources:
.
.
.
DataBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub ${DataBucketName}
RawDataBucketAccessRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
-
Effect: Allow
Principal:
Service:
- redshift.amazonaws.com
Action:
- sts:AssumeRole
RawDataBucketRolePolicy:
Type: AWS::IAM::Policy
Properties:
PolicyName: RawDataBucketRolePolicy
PolicyDocument:
Version: 2012-10-17
Statement:
-
Effect: Allow
Action: s3:ListAllMyBuckets
Resource: arn:aws:s3:::*
-
Effect: Allow
Action:
- 's3:Get*'
- 's3:List*'
Resource: '*'
-
Effect: Allow
Action: cloudwatch:*
Resource: "*"
Roles:
- !Ref RawDataBucketAccessRole

Here’s a quick overview of what’s going on:

  • “S3::Bucket”: The bucket name comes from a parameter called “DataBucketName”.
  • “IAM::Role”: This is the IAM role that allows access to S3. It doesn’t have any permissions yet but it allows the Redshift service to assume this role.
  • “IAM::Policy”: This contains a list of permissions for accessing S3 and Cloudwatch. The policy associates itself with the IAM Role.

So far, the architecture looks like this:

Create VPC and public subnets

You usually want to put databases in a private subnet, like we mentioned in our VPC template article. But in the early stages of a project, you might want direct access to Redshift from your development machine.

We don’t recommend this for production environments, but in this development case, you can start off by putting Redshift into your VPC subnet.

Resources:
.
.
.
RedshiftClusterSubnetGroup:
Type: AWS::Redshift::ClusterSubnetGroup
Properties:
Description: Cluster subnet group
SubnetIds:
- !Ref SubnetA
- !Ref SubnetB

We can’t put Redshift in a subnet directly, so here we put Redshift in something called a “ClusterSubnetGroup”. You can then add multiple subnets to the “ClusterSubnetGroup”. These subnets should be in different availability zones, which helps with high availability.

Subnet and VPC Parameters

The Redshift CloudFormation template doesn’t create any subnets or networks of its own. Instead, it asks you for parameters — two public subnets and a VPC.

Parameters:
SubnetA:
Type: String
Type: AWS::EC2::Subnet::Id
Description: Make sure this belongs to the VPC specified below (e.g. 172.31.0.0/20)
SubnetB:
Type: String
Type: AWS::EC2::Subnet::Id
Description: Make sure this is different from the subnet above (e.g. 172.31.16.0/20)
VPCID:
Type: String
Type: AWS::EC2::VPC::Id
Description: Select a VPC (e.g. 172.31.0.0/16)

You can just pick the VPC and public subnets that come by default in every region of each AWS account.

We wind up with this architecture:

Create a security group

So far, the Redshift cluster is in a public subnet. But before we can connect to it, we have to add a security group to allow port traffic to Redshift.

Resources:
.
.
.
RedshiftSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Enable JDBC port
VpcId: !Ref VPCID
SecurityGroupIngress:
-
CidrIp: !Sub ${DeveloperIPAddress}/32
FromPort: 5439
ToPort: 5439
IpProtocol: tcp
Description: IP address for your dev machine

This allows port 5439 traffic, which is the default TCP port for Redshift. This is locked down to the public IP address of your computer, which you provide via the CloudFormation parameter “DeveloperIPAddress”.

Configure security group access

During development, you’ll want to access Redshift directly from your development machine. But eventually, you want to make calls to Redshift from an application, such as AWS Lambda.

For this, you need to create other security groups and grant these access to Redshift.

Resources:
.
.
.
RedshiftSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Enable JDBC port
VpcId: !Ref VPCID
SecurityGroupIngress:
-
CidrIp: !Sub ${DeveloperIPAddress}/32
FromPort: 5439
ToPort: 5439
IpProtocol: tcp
Description: IP address for your dev machine
-
SourceSecurityGroupId: !Ref AccessToRedshiftSecurityGroup
FromPort: 5439
ToPort: 5439
IpProtocol: tcp
Description: Access to redshift
AccessToRedshiftSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Access to Redshift access
VpcId: !Ref VPCID
InternalSecurityGroupIngress:
Type: AWS::EC2::SecurityGroupIngress
Properties:
IpProtocol: tcp
FromPort: 0
ToPort: 65535
SourceSecurityGroupId: !Ref RedshiftSecurityGroup
GroupId: !Ref RedshiftSecurityGroup

This example builds off of the “RedshiftSecurityGroup” from the previous section. Here, we’re configuring two types of access:

  • “AccessToRedshiftSecurityGroup”: This is an additional security group that you might assign to an application, such as AWS Lambda. We add a security group ingress rule that allows inbound traffic on port 5439.
  • “InternalSecurityGroupIngress”: This is a standalone rule that allows resources in one “RedshiftSecurityGroup” to access another. It’s configured as a standalone ingress rule, because CloudFormation resources can’t reference themselves within their own properties.

More Redshift Cluster properties

As you wrap up development, you’ll want to start thinking about deploying to production. Here are a few tweaks to the Redshift cluster that we created in the first section that might come in handy:

.
.
.
Conditions:
SingleNode: !Equals [ !Ref RedshiftNodeCount, 1 ] . . .
Parameters:
.
.
.
RedshiftNodeCount:
Type: Number
Description: Number of Redshift nodes
Default: 1
MinValue: 1
ConstraintDescription: Must be a number greater or equal to 1
.
.
.
Resources:
.
.
.
RedshiftCluster:
Type: AWS::Redshift::Cluster
Properties:
ClusterSubnetGroupName: !Ref RedshiftClusterSubnetGroup
ClusterType: !If [ SingleNode, single-node, multi-node ]
NumberOfNodes: !If [ SingleNode, !Ref 'AWS::NoValue', !Ref RedshiftNodeCount ]
DBName: !Sub ${DatabaseName}
IamRoles:
- !GetAtt RawDataBucketAccessRole.Arn
MasterUserPassword: !Ref MasterUserPassword
MasterUsername: !Ref MasterUsername
PubliclyAccessible: true
NodeType: dc1.large
Port: 5439
VpcSecurityGroupIds:
- !Sub ${RedshiftSecurityGroup}
PreferredMaintenanceWindow: Sun:09:15-Sun:09:45

We add a parameter called “RedshiftNodeCount”. This represents how many Redshift nodes you want in your cluster.

We use a condition called “SingleNode” that checks if we have just one node. If so, we pass “single-node” to the “ClusterType” property. Otherwise, we pass in “multi-node” if more than one node was specified.

The “NumberOfNodes” property gets a little tricky. If there are multiple nodes, we can just pass in the “RedshiftNodeCount”. But if there’s just a single node, we get an error if we populate “NumberOfNodes” with any information, even if it’s just the number “1”.

The way around this is to use the pseudo parameter “AWS::NoValue”. If there’s just a single node, we pass “AWS::NoValue” to “NumberOfNodes” (which has the same effect as deleting that property).

Here’s the final architecture:

In this diagram, you can access your Redshift cluster from both your development machine, and an application such as AWS Lambda.

Outputs

Once you spin up a Redshift cluster, the first thing you want to do is connect to it. One useful piece of information to output would be the Redshift cluster endpoint.

.
.
.
Outputs:
RedshiftClusterEndpointAddress:
Description: Redshift Cluster Endpoint Address
Value: !GetAtt RedshiftCluster.Endpoint.Address
RedshiftClusterEndpoint:
Description: Redshift Cluster Endpoint
Value:
Fn::Join:
- ""
- - 'jdbc:redshift://'
- !GetAtt RedshiftCluster.Endpoint.Address
- ':5439/'
- !Sub ${DatabaseName}

Here we have the “RedshiftClusterEndpointAddress”, which gives you the DNS hostname of the Redshift cluster.

To make things even more convenient, we construct a JDBC url in the format of:

jdbc:redshift://examplecluster.cg034hpkmmjt.us-east-1.redshift.amazonaws.com:5439/dbname

which you can paste into your database client software.

Conclusion

Redshift is a really powerful data warehousing tool that makes it fast and simple to analyze your data and glean insights that can help your business. This CloudFormation template will help you automate the deployment of and get you going with Redshift.

Overall, there’s so much that you can do with CloudFormation and it’s difficult to review every little detail. But we hope that walking through these templates gives you a better idea of the power of CloudFormation and how you can use it to manage your AWS deployments.

Let us know if you have any questions in the comments below, we’d love to hear your thoughts.

Like this post? It likes you too. :)

If you enjoyed this post, please give it some claps so others can discover it more easily!

This article was originally published on ThornTech.com.

Like what you read? Give Thorn Technologies a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.