Docker, ECS, Splunk — making them work together and lessons learned
As we have mentioned earlier, at Springworks we are running our system as a set of microservices and mainly use Node.js.
We’ve been outsourcing our infrastructure to AWS since day one and we’re pretty happy with that decision. However, there were some issues that just kept bugging us. Being a group of smart and dedicated people that love to improve we started thinking of a way to solve our problems.
One of our goals was to speed up the deployment of our services and have a less complex way of setting up the infrastructure for a new service. Important for us was to be able to continue using the services that we need and love (eg. Splunk, New Relic) as well as be able to automate as much as possible by using our build management tool.
Probably the main benefit of using containers and Docker is abolishing the need to maintain and create provisioning scripts for our services. With that came the benefit of us no longer having custom AMIs for each service.
ECS handles the orchestration of our Docker containers in production. It helps with scheduling their placement across the cluster and running them in a highly available manner across multiple availability zones. All of this works without the need to install or operate a separate scheduler system like Mesos, Kubernetes or similar.
Docker and ECS
When we create a git repository for a service that will run in a container all that needs to be done in addition to our usual setup is adding a Dockerfile and a deployment configuration file. The configuration file describes the service and the task definition.
After fixing the easy part, we start facing some of the challenges. Setting things up in the AWS console, or using the CLI is all fun and games when exploring the options that ECS offers, but that won’t fly for a serious production system. We use CloudFormation to manage our AWS resources. We identified the need for: Docker registry, ECS cluster(s), ECS optimized instances and some tools that will help us build Docker images, create Task Definitions and create and update Services.
First we created an ECS cluster through the AWS Console. Initially we decided to have only one default cluster but we will iterate on that decision in the future when we have a better understanding on how we want to group our services.
Next we need a place to host our Docker images. We decided to go with ECR instead of Docker hub or hosting our own registry for simplicity reasons — it integrates nicely with ECS, it’s hosted on AWS which should make working with it more familiar to us etc.
We currently run an ASG with 3 EC2 instances on a VPC in 3 availability zones. We have decided to use Amazon’s ECS-optimized AMI which is configured to start the ECS container agent on boot up. If going for a custom AMI, then it is important to know that the ECS container agent needs to be installed and started on instance start-up.
The process from source code to a running task in ECS is orchestrated by our CI tool of choice — Teamcity. First the image is built and pushed to ECR by executing a set of Docker commands. Then we use a tool we built in-house that handles the register of a task definition and updates the service (or creates a new one if needed). And that’s it.
After the last step, ECS takes over, checks the desired count for that task definition and shoots up new tasks with the new revision.
But, what happened to the logs?
By this point we had our infrastructure in place, but we still haven’t found a way to forward our logs to Splunk which was a deal breaker for us. We use file based logging and forward the data to Splunk using Splunk’s Universal Forwarder (SUF). Setting this up was easy when we had our custom AMIs but turned out to be somewhat of a challenge when we made the decision to go with an AWS prebaked AMI.
Since we were not going to commit to ECS before solving this issue, we had to come up with a way how to get those log files to Splunk’s indexer.
We chose to stick to the SUF and have it running in a Docker container on each EC2 instance that’s running ECS tasks. We created a wrapper around SUF’s official Docker image and uploaded it to our private ECR repository. Next thing was to ensure the forwarder is always running on every EC2 instance that’s running in ECS even when we need to scale up if the load increases. We looked into several options and picked running a shell script as part of the Auto Scaling Group (ASG) Launch Configuration. This can be done by specifying the script to be executed as part of the User Data parameter in the Launch Configuration.
We specified the restart policy for our SUF Docker container to be always by running the container with the — restart flag. This way we make sure that if the SUF container dies for whatever reason, it will restart itself and keep doing it’s job of forwarding logs to the Splunk indexer.
What we’ve learned so far
As soon as we solved all of our crucial issues we made a decision that all new services will run in a Docker container in ECS and we’ll soon start with the migration of our existing services. That process will probably bring new challenges and learnings. Having a few services running with this setup leads us to the conclusion that ECS is living up to the promise of offering flexible options for running services as containers with great reliability for service uptime.
We still feel that we can do a lot of fine-tuning to our setup and reduce the amount of boilerplate so that team members are efficient and confident when setting up a new service.
However, having services no longer tied to EC2 instances is a huge step forward for us. It gives us better utilisation of the instance capacity, we have less infrastructure to set up and maintain, faster deployments, faster and easier scaling and easier rollback mechanism.