Designing and Developing a Serverless Infrastructure
Serverless at the center stage
At the second-day keynote at the Amazon re:Invent conference in 2014, Amazon made a pretty substantial bet on a new compute model: AWS Lambda was launched, and has subsequently been Amazon’s go-to technology for integrating cloud services and slashing server management costs. For — in CTO Werner Vogels own words: “No server are easier to manage than no server”.
Now, a couple of years later, AWS Lambda is the centerpiece of Amazon’s bid for dominance in a world increasingly hungry for IT services, not servers. Lambda is the glue that kit AWS services together: From easily creating new Amazon Echo skills, for processing streams of data from hot IoT devices to catering an unlimited, spiky appetite on image sharing from users on mobile devices. There’s also more mundane, traditional enterprise usages for AWS Lambda: As asynchronous queue workers and database triggers, for processing web requests to backend services and as scheduled jobs running within the Amazon cloud.
AWS Lambda generally offers the deepest integration on the AWS platform, and are a great driver for adopting new, higher-level AWS services — a great way of tying customers closer to unique AWS offerings. So is it a vehicle for vendor lock-in or a superior technology model?
No matter the viewpoint, the AWS Lambda value proposition for enterprise customers is a compelling one: Write simpler code, leverage higher-level cloud services, no initial capital expenditure on servers AND reduce project complexity by removing server management.
Server management — from procurement to maintenance — are traditionally in the realm of the enterprise IT organization, and is more often than not a headache and critical dependency for new projects. Server procurement times measured in multiple weeks are common, and it’s a general view that lots of enterprise IT organizations are not only slow-moving but also bureaucratic and siloed beyond comprehension. Add overworked or just outright dysfunctional to the list of non-flattering characteristics of IT service delivery organizations, and it becomes apparent why enterprise architects might jump at the chance of simplifying the process from business need to value delivered.
There’s a lot of reasons why it makes sense for large enterprises to be exploring Serverless architectures, and my company has been fortunate enough to be part of a Serverless implementation that went well beyond proof-of-concept and all the way to production — with a lot of hard, fun lessons learned. We learned so much; that we had to split the blog post into two parts!
So let’s dive into some of the lessons on the infrastructure side of Serverless: From pitfalls and surprises during the design- and development phase — and then, for the second part, go over the travails of integrating a Serverless architecture into an existing corporate IT structure — and move to go-live and beyond!
Designing the Serverless infrastructure
The system that we were going to design and build were responsible for providing marketing- and logistics data about every product that the company produces in a variety of formats (json/xml) to other systems that need those data (websites, leaflet generator, shops, resellers and more)
Our customer had already put some thoughts into the high-level flow of the data through the system when we started designing the infrastructure:
① An authoritative enterprise Master Product Data Management system writes changes irregularly to a set of DynamoDB tables that acts as a shadow of the source product data — mirroring it’s data model
② Writing to the shadow tables triggers a cascade of changes, updates, denormalization’s and transformations to an internal product data model
③ filtered data from the internal model is being sent out to receivers when required — either batched or as changes are appearing
④ parts of the product data model are exposed using Amazon API Gateway.
Some changes will impact a lot of receivers — other changes might end up having no impact at all.
As an analogy: coming up with a new name for your company will require updates to marketing material, e-mail addresses, business cards, etc. — many modifications at every level — all based on one small change.
Other changes such as an update to an employee’s title, have a lot less impact. So not all changes are alike.
The product data management system will also occasionally write out all its data — in effect refreshing everything in the Serverless system. The ability to timely process a flood of data in parallel is essential.
The architectural benefits of the internal data model are that it fits the need of the receivers better, and provides a decoupling that will make it possible to swap out the product data management system without having to rewrite the receiver end later on. A potential new system would just have to figure out how to do updates to the internal model, and the 20+ planned receivers of Master Product Data would not need to change.
Developing the infrastructure
From the earliest versions of the design, it became apparent that we would not have a “finished” infrastructure design anytime soon. The design would morph and change during development. So we had to decide on flexible tooling that would allow us to evolve the infrastructure as the design got worked out — that development of the infra-code would be ongoing as the application code got written.
Even though we were all partial to choosing Amazon tools, it didn’t take long to decide that CloudFormation were not going to be the infrastructure-as-code tool of choice. Developers didn’t like the naming scheme of CloudFormation resources, and we didn’t believe that CloudFormation would be able to keep up with evolving the infrastructure safely and swiftly (this was before the introduction of Changesets for CloudFormation).
We also had reservations about the ways we could extend CloudFormation with new types of resources — as we would be working with new resources from the very day they were released. Amazon as a company prioritizes getting new services into the hands of its users as quick as possible — often at the expense of having it supported in CloudFormation at launch day.
We had experience in using Lambda functions to create Custom Resources in CloudFormation but seriously disliked the required NodeJS callback hell (this was before Python were introduced as a supported language as well).
We considered just using shell scripts or write infrastructure code in Java/Python, but we ended up realizing that we would just end up re-inventing something like CloudFormation.
Working with Serverless infra-as-code
So off we went — with a handful of developers firing up their IDE’s and start adding features and chew their way thru’ requirements. Requesting environments to do testing and continuous integration in, and figuring out which infrastructure components to use.
I can tell from the experience, that it occasionally takes a lot of hard work to keep every developer and tester reasonably happy and not blocked. Tweaks to IAM security roles, DynamoDB tables, and additions of Lambda functions required on short notice.
Fall behind and developers will start tweaking the AWS resources themselves — which invariably leads to differences in environments and lengthy sessions comparing system details triggered by cries of “it works in dev but not in test”. Or the infra-as-code equivalent, where Terraform complaints about conflicts with cloud resources not known and managed by Terraform.
There’s also another aspect about infra-coding: it does at times feel quite unglamorously. It’s at least one step removed from the business people and business requirements, making it mostly a support function. It’s both about maintaining a handful of coherent environments and keeping up with the resources that the developers constantly needs to add and modify. It does sometimes feel like the project equivalent of a plumber: Doing a great job means that infrastructure is “just there.”
The constant trundle of small, important tweaks doesn’t always jibe well with sudden requests for bigger changes:
We decided to start using the API Gateway, so we set it up manually on the devtest environment — here’s the swagger file, please add it to the infra code to be added to all future envs. Oh, and the API Gateway has to be delivered in this sprint / kthxbye!
Well, gee! Thanks a lot! That has to be the infra-coders equivalent of a clogged toilet.
At least, that’s the feeling I got when I paged thru the Terraform documentation and found out that Terraform (at least then) lacked a quite a few of the API Gateway resources that we used.
So onto learning Go and discovering how to write custom Terraform providers. Which, admittedly, is a more pleasant way of getting your hands dirty than unclogging toilets.
In general. You should expect that new AWS features throw you curveballs that affect your setup. For instance, the addition of VPC support for Lambda functions suddenly introduced a dependency between subnet size and Lambda scaling.
That became problem because the subnet had been carefully sized and new IP ranges could not easily be procured (IP-range stinginess is often the case in more traditional enterprises). Key takeaway: Strive to control as much as the infrastructure in code as possible to make it easier to adapt to change.
Venture outside acceptable service use at your own risk
During the development of this system, we were always very cost-conscious, so it was decided to implement a service that would watch CloudWatch metrics and do on-demand scaling of DynamoDB tables and Kinesis streams.
Let’s consider this idea from three perspectives: From the business perspective, this means paying for fewer i/o operations in general — leading to a cheaper system. Everybody loves cheap.
From the Terraform perspective, this means that someone other than Terraform is messing with the resources. So initially, every time we ran Terraform to update resources, it dutifully reported the changes to the infrastructure and attempted to fix it by scaling it back to the initially configured values.
How about the Amazon perspective? They are providing the DynamoDB service and expecting that the users will buy DynamoDB tables with i/o operations corresponding to the max-load of the table (with the expectation that tables only occasionally reach that value). Overprovisioning means more money in Amazon’s pockets — so from the AWS perspective, doing on-demand scaling is gaming the system — borderline cheating. It probably makes table placement in the underlying DynamoDB infrastructure harder and might negatively affect other customers.
So what does Amazon do about it? They protect their systems (and their revenue) by imposing a rule: Customers can only scale down a table 4 times per day. It’s a rule on the table-name level, so you can’t even get around it by deleting a table and recreating it with the same name.
Which would occasionally put us in a situation where Terraform couldn’t modify a DynamoDB table because it already had been scaled down four times during the last day. Oooops! (in fairness: at the end of the project, we did manage to get Terraform to ignore scaling of i/o’s and changing the number of Kinesis streams)
There are at least a couple of things to learn here:
First: Amazon AWS does protect their services against what they consider unfair gaming of their pricing model and platform.
Second: Not all changes to the system goes thru’ Terraform. In this case, the reason was a cost-cutting feature of the system, but changes could just as well be made by Operations people in the console to handle a performance bottleneck in the production system.
Maybe there is a reason why continuous integration caught on?
During this project, we did strive to automate as much of the infrastructure as possible — with the goal of automating the creation of new environments to minimizing configuration drift and do away with human errors. It’s an ambitious goal and at times, we struggled meeting it. Whenever we did fix a stability problem, we then immediately went on and added new resource types with new, exciting ways of messing up reliable environment creation.
So I’m not ashamed to say that I hated the idea of running full integration tests (including Terraform provisioning and decommissioning the environment) from our build/integration/test Jenkins server. There was no doubt in my mind that it would fail often, and that the developers pushing for this, wouldn’t be the ones that would have to investigate failures and do the janitorial work with cleanups afterward. It felt like just one more thing to babysit, right?
But we did start running Terraform as part of the integration tests, and in hindsight, it was a pretty good idea. It did fail quite often, but the integration task also included full logs to aid with improving stability. The resulting cleanups weren’t that bad — especially after writing a Bash cleanup script that could be executed to get rid of partially build infrastructure.
To the surprise of absolutely no-one except me, the CI/CD credo about eliminating pain by doing painful things more often — with a higher degree of automation — also applies to infra-as-code. It did help us improve the stability in environment creation.
To sum it up
Implementing architectures with infra-as-code requires significantly closer collaboration between developers and infra-coders than implementing a traditional web application infrastructure.
By the nature of the frequency with which AWS introduces new services, there’s probably always going to be gaps to close. Choose a tool chain that’s easily adaptable.
The tools to develop the infrastructure parts of a serverless architecture have improved dramatically during the last year and are ready now. Full automation with <5 min provisioning time of complex environments is actually attainable.
- Choose tooling that makes it possible to evolve your infrastructure quickly and safely. You should be comfortable with extending the tool to implement special cases or the newest resources.
- Keep as much as possible of the system in code. That includes VPC’s and subnets — to make it easier to adapt to change in requirements or change required to use new AWS features.
- Be aware of how changes made outside of your provisioning tool are handled. Hard knowledge gained by operating the system in production should be fed back into the master configuration, not overwritten at the next deployment.
- Consider how you’re going to handle naming for the production system: Unique domain names such as www.mydomain.org or api.mydomain.org or the APEX record (mydomain.org) might need a bit extra attention.
- Establish lightweight (self-)governance around introducing new AWS Services: It’s understandable that application developers are excited about- and will want to leverage the very newest AWS resources, but “Does it Deploy?” That question should be answered positively before accepting new types of resources in the architectures.
- Set up automated continuous integration of your infrastructure — have a CI/CD server deploy your infrastructure, do integration testing and then tear it down again. If you’re not comfortable with doing it at every commit, then at least do it scheduled every night. Consider monitoring the AWS account for resources not deleted correctly. Create a script that helps with cleaning up half-build environments!
- Expect to be much more directly involved in the development process than in a traditional web application setup, and be open to learning the needs of developers while educating them about what you need to do your job.
I’ll put out another blog post about what we learned from putting this system in production. If you don’t want to miss it, then hit Follow.
Feel free to interact with questions, comments or 💙s