Running a Redis cluster on AWS with Terraform as IaC and Github Actions As CI/CD

Published in

Geek Culture

9 min readMar 14, 2022

Preface

In this post, I attempt to create a production-level Redis cluster with full CI/CD using Github actions, IaC using Terraform, using Docker compose for container orchestration, and deploying it all on AWS. There is definitely room for improvement, but I tried to cover as much as I can in this post.

Introducing the actors

Before we dive into the implementation, let’s introduce the actors on the stage!

Terraform

Terraform is used for Infrastructure as Code(IaC). What this means is that you simply describe your infrastructure as code, and that’s it! But the more interesting part is why you may need it.

Imagine you don’t use Terraform, and it’s not a small project for a blog post but a huge project with a big development team, constant changes, sprints, deadlines, and everything else, along with the most critical aspect of them all, a running production environment that shouldn’t go down. Most of the changes that you do to the application would probably be code-level changes that are not so complex, simply have a CI/CD pipeline that automatically deploys those changes on the cloud. If the application crashes or health is deteriorating, roll back the changes with the click of a button and investigate on a separate environment, but occasionally you may make changes to your infrastructure as well.

My question is how do you make these changes? Do you log on to your console, click a few buttons? How do you roll back then? Again, clicking on the GUI? What if multiple people are trying to make the same infrastructure changes at the same time? What if people forgot the change they made on the GUI? What if you wanted a quick rollback for large infrastructure, would you manually click a hundred buttons, type in “permanently delete” 5 times when your customers are filing tickets and the product guys are breathing down your neck, or what if you want to make changes at 3 AM in the morning but the developer who does remember the change is fast asleep?

The solution is writing Infrastructure as Code. So, if you were to write the EC2 instances, and the S3 bucket and RDS database as code, upload this to version control and use a CLI tool that takes this code and creates the infrastructure, you could fix all of those problems above. Not sure what the infrastructure could be before this big change? Just run git checkout HEAD~1 and you have your answer. Not sure who made this change? git blame will tell you who to blame(Unless they use git-blame-someone-else!) You could also automate it as a part of your CI/CD flow so you don’t need to manually set up everything, and you could provision the same resources you do in your production that you’d be doing in dev or staging, without accidentally forgetting the exact configuration. There are probably a hundred other reasons why IaC is a good idea but for now, just know that it is a good idea and we would use Terraform as the tool to do this.

Terraform is fairly straightforward, at least how we are planning to use it. Simply write a tf file that defines your infrastructure, run tf apply to deploy your infrastructure. There are nuances to this, like how we are using Terraform modules, which makes your config more manageable and modular, how we can use variables and outputs, and finally how to manage state, but if this new terraform thingy is new to you, don’t worry too much about this, just know we define resources(like EC2 instances or S3 buckets) we want to use in this .tf file and deploy them with a single command.

Redis

Redis is an in-memory database, that can also be used as a pub-sub system. There is a lot to be said about this, but I have been writing about Redis for the past few months, so check this stuff out if you are new to it.

EC2, ECR

For those that don’t know, EC2 is a service on AWS with which you can provision virtual private servers. You can SSH into these servers, run your code, or do almost anything else that you’d be able to do on your laptop via a terminal.

ECR stands for Elastic Container Registry and is a docker registry service in AWS. If you have Docker images and want to put them somewhere, you’d typically use a registry service. You create repositories in this service that you can use to push or pull your docker images.

Github actions

Github actions is the CI/CD tool we will use for this system. Simply put, we will define the steps we need to take to deploy the services and the infrastructure in a simple YAML file. Github would set up a workflow based on this YAML file on every push and deploy our services and our infrastructure

Implementation

Ok, so let’s start with the implementation. As usual, here is the source code for the impatient ones.

Part 1: The directory structure

Let’s start with how we are going to organize our code. Let’s create two directories in our root directory, services and terraform . The terraform directory contains all the code related to setting up our AWS infra, and the servicesdirectory contains the services we define. Apart from this, we need a .github directory to write our CI/CD as well.

This is what my root directory looks like right now

Part 2: Docker setup for clustering Redis

This is something I have previously talked about a lot here. Go through this before you continue.

I mostly used the same solution I had defined in the previous blog posts for clustering Redis but I did make a few minor changes.

Since I want the docker-compose file to run on an EC2 instance, I need to use a docker registry to store images and use the image directive in my docker-compose file instead of the build directive.

This is what my new docker-compose.yml file looks like

And since I can no longer introduce the REDIS_PORT variable at build time, I used it as an environment instead of as a build-time argument.

This is what the new Dockerfile looks like for redis-node

All this goes into the services directory. This is what the services directory looks like now

I have previously explained the redis.conf file and the Dockerfile for cluster-setup here in a previous post.

Part 3: Writing the code that runs on EC2

Now that we have the docker containers we want to run on our EC2 instance, how do we go about running these containers? Well, to do that, we have to write a script that executes when the EC2 first starts up.

This is called a userdata script. We will configure EC2 to use this script when we set up the EC2 instance using Terraform.

This is what the server-userdata.sh file looks like

I have added comments to explain the code but in case it is still not clear, what we are doing is

Installing Docker and docker-compose
Saving the docker-compose.yml file
Setting up AWS ECR
And finally, running the docker-compose.yml file on our EC2

This file lives in our terraform directory since we will link it in our main.tf file.

Part 4: Brief on the terraform modules

Before we dive into the main.tf file, let me brief you on a couple of modules we are using.

For those that don’t know, Terraform modules are smaller pieces of terraform code you can reuse in your projects. Similar to writing a library where you can define functions or variables and then import that in multiple places in your code. This helps you reuse your code, make it more modular, nicer looking and easier to read, etc. And it’s generally considered a good practice to use terraform modules.

The way I have organized it in my directory structure is as follows —

There is a main.tf file, a modules directory, and the server-userdata.sh we talked about before.

The idea is simple, I write a main.tf file that imports and uses these modules.

Let’s look at how these modules work under the hood as well.

Both of these modules have their own main.tf files, along with a variables.tf , output.tf and a README.md file. The objective of the README.md file should be obvious. The variables.tf and the outputs.tf files make this module actually modular. Using these files, you can configure the modules to take inputs and return outputs, similar to arguments and return values in a function. This is the reason your modules can be modular.

Now, on to the project at hand.

There are two modules that I wrote and we are using here, the ec2-instance module and the single-public-subnet module. Simply put, the ec2-instance module provisions an EC2 instance, along with security groups and any more requirements, and the single-public-subnet module provisions a new VPC, with a single subnet that is public, along with route tables, internet gateways, etc.

If you are interested, you can check out the code on GitHub but it is not really important to what we are building right now.

Part 5: The root main.tf file

This is the file that terraform would actually run, which imports all the modules and deploys, or destroys our infrastructure.

This is what it looks like

It is fairly simple since most of the code is abstracted away in our modules. It simply sets up our backend as S3 and DynamoDB(this is where Terraform would store state and lock files), uses the two modules we talked about, and sets up two ECR repositories that we will use to store our Docker images.

Part 6: The CI/CD

Finally, the part that actually sets it all up is the CI/CD flow in Github actions. This is fairly large, but don’t worry, I will explain it in depth.

We have two jobs, build-services and update-infrastructure .

build-services configures our AWS credentials and ECR, builds our docker images for redis-node and cluster-setup and deploys it to ECR.

Once build-services is complete, update-infrastructure runs. This sets up terraform, removes the EC2 instances, and then recreates the infrastructure based on the current state.

This is what my complete directory structure looks like right now

Part 7: Running it all

To run the entire thing, just push your code, and if everything was configured properly, your CI/CD should automatically build your Docker images, push it to ECR, initialize Terraform, set up your infrastructure, and deploy everything!

Conclusion

While the above solution works, there is still a lot of room for improvement. The fact that I am deploying it on a single EC2 instance as opposed to a cluster is probably the biggest flaw in the system in my opinion, but for now, this is all I have done in this project. Maybe in the near future, I might think of adding a proper container orchestration service like ECS to the stack as well.