Pre-baking AWS AMIs — a real life story
Immutable infrastructure — the goal we were striving for recently across all the environments, from DEV, through INT, NFT, PP, down to PROD. Having our whole infrastructure in AWS we thought the easiest way to achieve that would be baking AMIs and promoting them down the line. That was a totally correct approach, however, to keep things consistent and leverage Jenkins multi-branch pipeline capabilities we decided to create images during each pipeline run of a feature branch in order to deploy them in a dev environment, run some tests and tear down the environment for further reuse. And this is where the story begins.
Stage 1

It all started naively, first runs and tries went smoothly. We were bold and went all in, rebuilding all images (12 microservices = 12 AMIs) on each pipeline run. We could, so why not? We had a base image that was being periodically rebuilt for system updates etc. and that was used as a source AMI for our Packer job. During that stage I don’t know if we even had 2 parallel pipeline runs but the Packer step that included deployment of packages and Puppet apply was taking around 10 minutes — acceptable, so let’s go broadway! Let’s have 6 parallel pipeline runs as we were able to support 20 fully separate dev environments in total. That was supposed to be the final test before the official roll-out of the new, consistent and fully automated CI pipeline. And this is where it all started to go wrong.
Stage 2 and 3 and all the things!

The can of worms was already open and we were facing the following issues:
- AMIs build time had spikes up to 27 minutes,
- failures because of exceeding different AWS API limits.
We didn’t have much time to go with all that new knowledge back to the design phase, rethink everything and maybe find a better solution. We had to fix it having a bunch of managers behind our backs, so we rolled up our sleeves and got to work.
1st conclusion (and the most important one)
During the pipeline run, do only what is absolutely necessary, everything else should be already there. So instead of installing the middleware with Puppet every time, have it already installed in your base image(s) and deploy only packages that reflect your feature changes. This is how we reduced the Packer build time. Unfortunately, AWS API limits problem was still there.
2nd conclusion
Do not build everything! What for? Most of the time a single feature was touching only one or two microservices so there was no need to rebuild them all. Reuse! Use previously built images instead, spare time and above all spare AWS API calls! It is very important to know that AWS API limits are per AWS account and their allowed number differs between AWS regions (depends on the amount of available compute resources etc.). Moreover, they are nowhere defined publicly, you can’t find any documentation of any explicit limit per call type, per account or anything like that. But that seems to be understandable.
Packer vs. AWS
To understand the problem you should first fully understand the AMI creation process, at least its crucial part. Let me clarify it for you…
At the end, after all, your provisioners did their job, the system creates a snapshot of an EBS volume and that snapshot time depends on the following 3 factors:
- total volume size,
- dirty ratio (how many blocks were changed) — % dirtiness,
- total traffic to S3.
Yes, there is a limited amount of bandwidth on traffic to S3 so depending on the workload it can increase the total snapshot time thereby causing a delay to the overall time it takes to finish with AMI creation process. Also, AWS doesn’t give out an ETA on the snapshot or an AMI because of the S3 bandwidth constraints. AWS makes sure that the snapshot completes and the data is available upon creating the volume and does not focus on the completion time (again due to limited bandwidth on traffic to S3 from the EBS volume). In other words, the amount of traffic to S3 from EBS is least prioritised on the AWS end.
While waiting for a snapshot to become available Packer periodically queries AWS API (aws ec2 describe-images) to find out if an image is ready etc. Then it usually applies tags, which is another type of an API call you can suffer from if you execute them too many times at once.
Back to the story…
So we did it, we applied a logic that allowed us building images only when it was necessary. After a month of testing our stats stated that we were reusing images in ~70% of cases. Great improvement!
Another thing we had to take care of (Packer was not the only source of AWS API calls) was to make sure that any other mechanism had a proper retry capabilities so that even if we hit a limit excess that operation will get repeated successfully.
Extra hint! Packer on an unexpected failure tends to leave behind running instances and security groups created in a VPC you let him use. It’s good to have some scheduled jobs that clean those leftovers for you. Running instances cost money + there is a default limit of 500 security groups per VPC so watch out. A simple set of python scripts is a one way to go here.
Why baking?

Who said that baking images in a way described above is the only way? — no one! Maybe if we knew that we would face so many issues along the way we wouldn’t think about the time and who’s behind our backs and we would do it differently. Or maybe not. Different projects have different roadmaps, follow different rules, require different solutions and have different priorities, so… yes, who knows how would it be?
However, jumping through hoops lets us gain more experience, gets us to know more about various aspects of our work, technology constraints and limits! This way we possess that knowledge why one solution is better than another.
So, what would be a better solution here? Baking is fun, so why not to keep it? Yes, but in our scenario only as a prep step. Instead of fighting with Packer we could just have a set of base images with all that middleware already installed and an initial configuration applied or even not. Then, there are these user-data scripts / cloud-init directives that can be parametrised and that on boot can just obtain appropriate artefacts from yum, S3 bucket or anywhere else and apply everything how we want.
And that’s the next step we will be looking into!
There are of course some drawbacks like e.g. you can’t see these things happening live the same way how you can watch the Packer log during his execution but it’s all about a proper testing and staging that solution, right?
The whole flexibility we could gain and that I can already see on the horizon is worth giving it a try.
We’ll see how it goes and hopefully, I’ll be able to continue this story.
Big up!
