Experiences with Amazon ECS/ECR & CloudFormation
Recently I’ve been experimenting with two Docker based infrastructure solutions for the migration of a larger web app we’ve had on-premise for a number of years.
This post will focus on Amazon ECS/ECR. I’ll cover Docker for AWS in a future article – it is also really cool and has some great advantages using the newer Docker 1.12 swam/orchestration features, but was in private then public beta at the time of investigation. I really liked it and found the community and developers great, just keen for it to be out of beta and in production use for a while before adopting.
Overall ECS/ECR worked great – its really well integrated with the rest of AWS infrastructure and tooling. At re:Invent 2016 ECS was described as core technology that other AWS services are built on top of, and I also think likely a great choice for your Docker based infrastructure.
As a great starting point, watch Paul Maddox’s excellent talk, Operations Management with ECS/ECR from re:Invent 2016. He steps through some great insights and points in using ECS/ECR and related AWS infrastructure for Docker based apps, and lays the foundations for getting you started with an excellent CloudFormation template, available on GitHub to fork/clone that codifies all the resources in use.
The template sets up a VPC with public and private subnet across two availability zones in the one region. The private subnets host the ECS cluster, and the public subnets provide outbound internet access for the private subnets via NAT gateways. An ALB is configured to route incoming http/s based requests to your containers within the cluster, using the request path as the routing key.
I found the templates to be well structured into logical groups defining the VPC, security groups, ECS cluster, and services that are being deployed. In addition, parameters and outputs are used to share configuration across all the templates in the stack, making it easier to extend. All resources in the stack are tagged with the environment name as well, great for things like cost monitoring/diagnosis.
The example deploys 2 independent services that appear as the one web app. If you have a microservices based architecture based on http and API’s this will suit well – in our case we ended up with ~11 services, although not all were serving http requests.
If you’re using a non-http communication protocol between services, eg AMQP for queuing/consuming messages or some other protocol things are a bit harder but not impossible. The main issue we found was with defining a highly available clustered RabbitMQ piece of infrastructure across both availability zones within the ECS cluster, and discovering it from any service that required access to it.
Creating RabbitMQ cluster within the ECS cluster is a bit of work and definitely possible, but ultimately a provider like CloudAMQP can help immensely and you can even have it install an AWS RabbitMQ setup directly within your VPC/zone – a few clicks and we were underway, management interface and all.
Service discoverability (ie how can I communicate with this service and what is its hostname?) is an area where Docker for AWS excels with its 1.12 Swarm mode and overlay network features. With ECS and this particular template everything is accessible all via the ALB, so again if you’re http/API based all is good.
The only other clustering need we had was the database, which RDS Aurora suited perfectly. We extended the CloudFormation template to create the RDS Aurora cluster within the VPC. Originally we created it within the private subnets, but later moved it to the public subnets at least during development so that we could communicate to it from an office range of IP’s and install snapshots, etc for testing.
We also added a Bastion host which we could ssh into – as a side effect it acted as a gateway for office terminal access to the RDS and ECS cluster in the private subnets via security groups. Particularly useful was enabling IAM role based ssh access to the Bastion so that those IAM users with ssh keys were automatically able to access the Bastion rather than use a named/shared key-pair.
At this point our app was up and running in ECS, and we were able to access it form the ALB output URL in the master template. Awesome.
Taking things further in to the area of productionizing the environment, we added various CloudWatch alarms to the CloudFormation templates for various metrics across the ECS cluster, the ALB, and each service. In addition if your containers are logging to STDOUT, all that is captured and also sent to CloudWatch as well. You can also define metrics for particular strings within logs, and set CloudWatch alarms based on the occurrence of those metrics which is great (top advice from Paul’s talk).
We then added a CloudFront distribution to act as our CDN for static assets, and shared its endpoint outputs with the web service to automatically set the Rails asset host settings.
For scheduled work ie. recurring time based job processing like background tasks that you’d run from cron in a non-ECS world, there’s a few approaches we looked at with various degrees of convenience.
To start with we created a “schedule” service that ran cron itself (use the -f flag from memory to keep it from daemonizing) and allowed us to move our tasks across piecemeal, however it wasn’t the most ideal solution since it was running many background items of various degrees of compute requirements on only one host within the cluster.
What’s would be more ideal is to be able to have a native scheduler and send the jobs to any host that can run them within the cluster, allowing the cluster to best balance the compute requirements across all hosts within the cluster. AWS Blox is a step in this direction but the equivalent to a cron based scheduler isn’t there as of writing. Something like Chronos in the future maybe?
In terms of workflow ECS with CloudFormation allows many options – one example we were experimenting with was having pushes to GitHub trigger a AWS Code Pipeline task, run an AWS Code Build definition for our automated test suite amongst other things and push a resulting image to ECR. Code Pipeline can then automatically deploy that new image to an environment automatically. For extra fun a small SNS topic queue can send a notification when its all done.
Taking things further, ECS supports auto scaling at the service level, allowing you to scale up the number the tasks for each each service based on various metrics – and you can further setup auto scaling for the ECS cluster itself. This was a big attraction since most of our traffic is during business hours, and portions of our application can be reduced in compute power significantly overnight or on weekends.
Insight into this requires learning more about the memory and CPU reservations required so that the cluster task scheduler can make good decisions about where to place your containers given what cluster capacity is currently available, and also provide good feedback about when the cluster can be scaled back.
Overall we found ECS/ECR and CloudFormation in particular to be a great experience and work well. The new system is faster, costs less, can be updated easily at an infrastructure and application level with ease and can be cloned into as many environments as we’d like.
Based on our experience, I’d recommend it, particularly with Paul’s CloudFormation templates as a great starting point. Keen to hear your thoughts and experiences with ECS/ECR as well, let me know?
Note: each environment consumes 2 EIP’s, 3 if you wish to use one for the Bastion. Each Amazon account has a limit of 5 by default so if you’re planning on creating many environments its something you’ll want to request more of.
Note: Another interesting idea Paul mentions in his talk is utilising the spot market for development/test environments for cost savings.
Note: Incidentally we also found ECR to be much faster to push images to than Dockerhub, presumably due to location.
Note: Paul’s CloudFormation templates did require one update to add the environment name to a setting to allow creation of multiple environments within the one account, I’m intending to add a PR for the change, until then if you have this problem, sub in EnvironmentName to make it unique.