An approach for building infrastructure as a code
How we went from the idea to the cloud infrastructure using Terraform here at Sudo Labs.
Context
In this blog post we are going to share how was the voyage for implementing the infrastructure of a cloud application that has, as a core requirement, the need to support multiple isolated environments of itself. We are not going too deep in the details but more in the decisions we made to make this process reliable.
The original application is basically (omitting a lot of features for the sake of the blog post) a dashboard for seeing the status and data of medical devices. The communication between them and the application is via the MQTT protocol.
Here is a basic architecture diagram of it:
This app is delivered basically as an ISO image that needs to be installed in a physical or virtual device. Omitting a lot of details is basically a virtual machine that runs the app and the different components using Docker.
We have now the requirement to host this in the cloud and also be able to create multiple isolated environments of it. How do we even start?
First Step: Replicate the environment manually in the cloud
The decision we took was basically trying to move the different pieces of the app to the cloud. For this, we picked AWS ECS for hosting the different services. We basically took all the Docker files of the current application and we moved them to ECS.
We built all the infrastructure “manually” (without Terraform) and we tried to have an MVP working with at least the basic features of the app (connecting a device, being able to log in to the app, etc.). The logic of the base app didn’t change at all, we only updated how we generated the images and we tweaked some environment vars.
Also in this process we picked AWS IoT Core as our MQTT broker, which would be included in a “global” environment and would communicate the devices with our manually isolated environment. The infrastructure looked like this:
We basically did all the work manually (creating the infrastructure using the AWS UI) to understand how each part of infrastructure worked and to iterate faster (without applying the terraform changes in each iteration). We had a lot of trial and error before being able to have the app working but this helped us to understand better how the app could work in the cloud and also was a good exercise for refreshing some knowledge.
Second Step: Create environment using terraform
Before we got all the manual app working we started to code some parts of the app that were already working. For example, if we had the API already connected with the DB and answering requests we tried to code that in Terraform.
This approach helped us to work almost in parallel (one person manually creating the infrastructure, the other coding the parts that were already completed). It had some fallbacks because there were occasions where we needed to change/refactor parts of the infrastructure and in consequence update the Terraform code.
Infrastructure as code
We stored the infrastructure code in Github like any other project and this helped us to keep track of the changes, share the code, work using feature branches/pull requests, etc.
One tool that was useful for having a “starting” point of the code was https://former2.com/. We used this tool to export the handcrafted environment to terraform code and then start iterating the changes from that baseline.
Workspaces
Sometimes we also needed to implement parts of the infrastructure of terraform in parallel. For testing that the changes actually worked we needed to test them in different environments for being sure we won’t be introducing changes in our coworker’s infrastructure. For this we used Terraform workspaces.
Basically we save the different states of the different workspaces in different folders. In this case we opted to store them in S3:
With this we could test our changes before merging them into the main branch. Also this was the “base” strategy for supporting multiple environments of the app, each environment would be stored in a different workspace, more on that later in the post.
Once we could have an environment fully working just with running a Terraform command it was time to move to the third step, which was basically being able to call it from a user facing app.
Third Step: Integrate Terraform
We created an app for allowing new customers to complete a flow for requiring new environments. This app was implemented using Django and is the one we mention in Deploy Django App using Copilot.
We are not going to talk in detail about this app but basically at the end of the flow we call an API that is in charge for running the Terraform code. In this post we are going to focus on how we built a solution that allowed us to execute the Terraform code.
Architecture
The solution was made using 3 main technologies:
- API Gateway: Exposed API that will allow the app to request a new environment
- Step Functions: State machine that will be in charge for orchestrating all the process of calling Terraform, Lambda, handling errors, notify, etc.
- ECS: Set of AWS services that allow us to run our Terraform code in a Dockerized environment.
In the diagram above we show how this process is usually executed (when there are no errors). The step function is also designed to handle errors and it possible states look like this:
This helped us to design a workflow were we can take a specific action if the creation of the environment fails, there is an error when notifying the failure, etc.
Identifying Environments
For being able to identify each environment we used the organization name and hospital name fields we grab from the user in the subscription form. This is included as a variables in our terraform code and is taken as a key for the environment workspace. Here is how they look in S3:
With this we can manage the state of each environment separately and we could also try infrastructure changes in one of the environments without affecting the others. In the future we could build several services over this for managing infrastructure updates, cloning environments, etc.
Conclusion
Some lessons we learned in this project are:
- Sometimes is better to start with something really simple and has a high impact and then iterate over that solution.
- You don’t need to know at the “low level” details on how a component is going to work when designing a solution.
- No solution is perfect and you will need to re-think and re-design parts of the components after you started implementing it.
- The ability to “zoom-in” and “zoom-out” over the different part of the solutions is essential when designing a big system.
- When you are working in something difficult that requires a lot of collaboration try having a “war room” meeting where you do pair programming/screen share with the rest of the team for a couple of hours or the whole working day. This helped us a lot for making the terraform code work as expected.
Do you like this implementation? Do you have any comments or suggestions? Please leave us a comment, we would love to hear from you.
Interested in getting a serverless environment set up to help scale your business? Reach out to us at Sudo Labs to see if we can help.