“A drone shot of an intersection with an overpasses” by Edouard Ki on Unsplash

Kinesis Firehose to Redshift pipeline with Terraform

Andrej Ocenas

--

AWS is a great piece of technology. It can help you build powerful solutions by stacking up pieces like lego bricks. Getting them to stack up right though can be challenging and sometimes you can end up reading about routing tables when you just wanted to get some data into your database.

This should be a semi complete example of how to create a pipeline with Firehose that pushes data into Redshift from a Node server, including creation of VPC and IAM permissions. There are obvious missing pieces with this setup like better security setup, logging and tighter permissions but it should get you started.

If you want to jump to the code here is the repo. Otherwise I will try to give some explanations and context so it is easier to modify and reuse the code for people not proficient with Terraform and AWS.

Disclaimer: This code creates resources, that while running, will be billed to your AWS account.

Overview

Architecture of this demo is simple: our Node server will receive post request with JSON payload of the data and pushes the data into Firehose. Firehose will store the data into S3 bucket and will trigger COPY operation on Redshift to load the data from S3

Terraform config

Terraform helps you describe and create/modify/destroy your infrastructure. When running terraform apply it will create a plan of execution which will be executed after confirmation. terraform destroy will destroy all resources defined in the configuration files. You should understand that sometimes terraform destroy can timeout or fail to destroy some resources. If you worry about incurring charges, make sure you understand what was created so you can destroy it manually if needed. Particularly, S3 bucket used in this example to store data will not be destroyed by Terraform if it is not empty.

Provider

We will use AWS Terraform provider for creating the AWS resources:

We specify region and profile here to define where to create the resources and with what credentials. When you setup AWS CLI it will create default profile for you which would be used here if profile is not explicitly specified. When working with multiple AWS accounts (like company account and personal account) I found out it is easier for me to not have default profile so I have to specify profile explicitly.

S3 resources

We will create S3 bucket where Firehose will store the data. Shipping data first to S3 is necessary in this setup as Redshift does not talk to Firehose directly.

We will also create an S3 object with some configuration. This will be used as a mapping from the JSON data in S3 files to Redshift columns. This is not necessary in case your JSON data are simple objects with keys equivalent to Redshift table columns.

IAM

Next we create role and policy so we can access the S3 bucket. We will use single role for both Redshift and Firehose to simplify the setup but otherwise you would probably want separate roles. We also did not setup any logging or encryption so we only add S3 permissions. You can read more about required permissions here.

"Principal": {
"Service": [
"firehose.amazonaws.com",
"redshift.amazonaws.com"
]
},

This part says which services are allowed to assume this role. We have to specify both services here otherwise you would not be able to use the role even if you configured it for the resources.

"Action": [
"s3:AbortMultipartUpload",
...
],
"Resource": [
"${aws_s3_bucket.s3_bucket.arn}",
"${aws_s3_bucket.s3_bucket.arn}/*"
]

Only actions needed by the Firehose and Redshift to access the data are allowed in the policy.

VPC

This section is kinda optional, depending on whether you have default VPC in your account or not. If you have old account that still allows for running instances outside of VPC you probably do not have default VPC and will need to create one for the Redshift cluster. If you have default VPC you can run it there or you can still create separate VPC for this demo.

This creates a VPC with one public subnet. We need the subnet to be public because we want to access our Redshift from outside but also because Firehose needs access to it.

Then we create a security group that allows all traffic in and out. In real life scenario you would want to allow access only from trusted sources like Firehose or your servers.

Finally we just create a resource needed for connecting the subnet with Redshift.

As a side note, this is the only part where I use Terraform modules. Modules are nice way to structure your Terraform config and encapsulate logic into reusable parts with inputs and outputs, similar to putting your code into functions. You can find modules in the registry. You could probably shorten the config with more modules and my decision where to use modules is mostly arbitrary.

Redshift cluster

We link our subnet, security group and role with the cluster. We need to set publicly_accessible = true so it gets public IP address. Option skip_final_snapshot = true allows us to destroy the cluster faster without any data hanging around afterwards. In production you would probably want it false. Similarly you should not keep passwords in files that should be committed to your VCS. For additional info about how to handle secrets with terraform look in the docs or here for a bit more comprehensive explanation.

We also use local-exec provisioner to run sql that will create a table to store the data. This is run only on the creation of the cluster. If you do not have psql locally you can comment it and run the the CREATE TABLE manually.

Firehose

As mentioned before, we still need to setup S3 destination even when final destination is redshift. Again we need to link IAM role and specify access to the Redshift cluster. We use copy options where we specify the mapping from the JSON to columns. Important thing to note is that by default, Redshift will not understand ISO date format and you need timeformat ‘auto' in the options.

Firehose creates files in the S3 with time prefix and also a manifest file that is used in the COPY command issued from Redshift. This means that if you delete the manifest file, Firehose will still create new data files but without creating another manifest file and the COPY will not work. Seems like recreating the Firehose or recreating the manifest manually is only way to recover from that.

Client

Client is fairly standard Node server that uses AWS SDK to push received data into the Firehose. After POSTing to /data endpoint in format { "name": "test_value", "value": 1.0 } you should see it in Redshift. This can take a minute as data is buffered by Firehose and is not delivered real time to the S3. buffer_interval setting controls this behaviour but 60s is minimal value.

Again, full code can be found here. Follow the readme to run the whole demo.

If you find any problems comment here or create an issue in the repo.

--

--