Amazon Redshift Setup — Tutorial for Use with AWS VPC, EC2 and PostgreSQL

A step by step walk through for this (too awkward and laborious) process.

Will Nowak
AWS Tip

--

I’m new to AWS Redshift (having used AWS RDS and Google BigQuery, but never AWS warehouse solution), and found it more painful than necessary to get up and running with a hello world solution. Hopefully this walkthrough makes things easier on you than it was for me.

Let’s dig in and get around the specific instructions, which is where, if anywhere, this post will add most value. Strap in, as this will probably take ~30 minutes to completion.

Quick disclaimer — this is made for getting up and running in a demo instance. I bear no responsibility for security flaws that might arise from this set up. That said, please do comment below with ways to improve this workflow!

(Step 0…have an AWS account and log into AWS console.)

I. Set up AWS VPC

  • VPC
  • Start VPC Wizard
  • Select VPC with a Single Public Subnet
  • Add appropriate labels for VPC name and Subnet name
  • Create VPC
  • You should see “Your VPC has been successfully created.”

II. Create security group

  • Go to VPC console.
  • In the navigation pane, choose Security Groups.
  • Choose Create Security Group.
  • Associate with the VPC you just made.
  • Name and use previously created VPC
  • Name your new security group, if you like, by clicking in blank space in “Name tag” column.
  • Select your group and click “Inbound rules”
  • Edit
  • Choose Add another rule, then select SSH (for Linux) or RDP (for Windows) from the Type list. Enter your network’s public IP address range in the Source field. (If you don’t know this address range, you can use 0.0.0.0/0 for testing purposes; in production, you authorize only a specific IP address or range of addresses to access your instance.)
  • Save and close.

III. Create EC2 instance

  • “Launch instance”
  • Follow along with defaults, but be sure to:
  1. Use VPC and subnets you created.
  2. “Enable” Auto-assign Public IP
  3. Choose the existing security group you made in step 2.
  • Review and launch. Ensure you have pem key for connection.
  • Connect to EC2 in usual fashion. Find new instance, click “Connect” and copy and paste chmod and ssh commands into your terminal (in directory where .pem file is located).

IV. Redshift security

  • In the Redshift dashboard, click subnet groups. From here, “Create cluster subnet group”. Make sure your VPC-ID matches your VPC (if you have multiple VPC’s).
  • Find an available zone and select your subnet ID.
  • Add
  • Create

V. Launch RS cluster

  • From RS dashboard, “Launch cluster.” Fill out information as desired, and of course recall your password as you’ll need to connect to the data warehouse.
  • Select number and type of nodes and desired. There is a lot of literature online about the differences and pros and cons of these selections. This is beyond scope of this tutorial.
  • Make sure you select your appropriate VPC and cluster subnet group.
  • Select the security group you made before as your security group.
  • Proceed. It typically takes ~10 minutes for provisioned cluster to appear in Redshift dashboard.

VI. Connect permissions between EC2, Redshift

  • If you followed along, you should see that your Redshift endpoint warns you that it currently has no inbound permissions. You’ll have the option to edit your security group to fix this.
  • You want your VPC security group (listed in your cluster Configuration tab) to have a Custom TCP Rule that allows traffic to your database port (often 5439) from the EC2 security group source (you want to input the Group ID — something perhaps like “sg-82ik2b13”.

VII. Connect to Redshift

  • In EC2, run sudo yum install postgresql
  • psql -h <endpoint> -U <username> -d <db_name> -p 5439
  • You should be prompted for your password. Upon successfully entering your password, you should now have access to the Redshift cluster through the psql command line interface.

In terms of other resources:

Hopefully this helped. Props to my friends Drew and Ian for helping me out. He provided the key insights — we want to make a private network, allow our local machine to connect to an EC2 virtual machine, and finally connect this virtual machine to a data warehouse in the private cluster. This is the essence of all of the above steps, as I see it.

Please comment below with tips or suggestions for me and/or other practitioners! Let’s try to keep this current — so please do respond if you had issues when following along.

Please clap if useful. Please contact me at will@unboxresearch.com if you’d like to speak more about me, you, data science, machine learning, or the like. I do machine learning consulting at Unbox Research, and we are always happy to discuss your business and how we might be of assistance.

--

--