As the title suggests, we had an interesting experience, learned a few things from it, and now we’ve got to share it on the www.
What you can expect from this post:
* How testing environments aren’t risk free (cost wise at least).
* How multiple unrelated changes can morph into a massive bug.
Because no one escapes Conway’s law let me start with a quick intro of the team structure and the project.
My team (called Peas) is responsible for finalising Gousto customers’ orders and routing them to the most adequate fulfilment centers. We are in the process of developing a new service which looks at orders and, based on various metrics, makes an advanced routing decision. The goal is to have a better idea of what orders we will have to fulfill before they actually reach our factories, which helps with planning and stocks. This process is repeated several times for each order to adjust to changes.
The routing service is a container running on ECS, and waking up on a cron. Its job is pretty simple, grab all relevant orders for a given date, and, with the help of some additional data (various configurations) decide which order should go to which site.
Importantly, we store all previous routing decisions for orders.
Where did it go wrong?
During development, the database to scan was nearly empty. At most one or two orders for testing. Most of the testing is done in memory, using mocks or local aws resources. As a consequence, the service would wake up, and almost immediately shut down because there was no work to be done. As such, we set the cron to wake the service up every minute. After all, who likes to wait several minutes to see their changes in the dev environment?
In the meantime, another pair of devs was working on the lambda function, and a tool to import orders. Such a tool is crucial when developing, when fixing bugs, and also not to miss orders modified before the service went live.
When we ran this tool in the dev environment, that’s when everything went south. We imported a subset of almost-real orders for testing purposes. Suddenly the database was very much not empty anymore, and the routing service had some real work to do. Every minute, a new service would spawn, start scanning all orders, and augment them.
DynamoDB charges by the KB read and written. And on each iteration, every order record would become slightly bigger, so reads and writes would become gradually more and more expensive.
This resulted in a rapid linear increase in billing for dynamoDB. Here’s the read consumed:
The write graph shows a similar curve, with higher numbers because writes are more expensive than read.
As soon as we realised what was going on, we shut down the service, increased the cron frequency and purged the table. Most orders had more than 1000 routing decisions attached to them. Between dynamoDB, ECS, a few hundred million messages published and a few other bits of infrastructure, this cost our team around £4500.
This incident is interesting on a number of fronts.
Detection tooling in development environments
Detection tooling in development environments. Although we have better tooling in production, we don’t have much in terms of anomaly detection for other environments. AWS cost alerts also tend to lag a bit behind. We received the alarming reports a day or two after the start of the incident.
Interplay between complex systems
Although we are a team of 5 engineers and we pair quite often, no one saw this coming. When working on small individual tasks on complex systems, it’s very easy to forget the big picture. Especially on bigger projects, where simply keeping everything in one’s head is challenging. This is more of a prompt for everyone to reflect on their development practices.
We had a very high cron frequency because we didn’t invest in development tools. We had no other way to see our changes in AWS other than waiting for the cron to trigger the task. This incident helped to drive home the importance of these tools, which usually pay for themselves many times over, even though they seem like a dead weight at the beginning of a project.
Kill switch mechanism
You know the big red button present on most (all?) big machines to immediately stop their operation in case something goes horribly wrong? It’s good to have something similar in software systems. Bonus points if these can be triggered automatically. Not all systems have that luxury, but in our case it’s completely acceptable to be down for a few hours until the team can fix it properly.
We’ve had higher bills than 4.5k in the past, although it’s way above our usual team AWS budget. The good news is that it highlighted quite a few blindspots in our observability and alerting tooling. It also served as a valuable reminder that some bugs come from the unforeseen interaction of otherwise simple components, and a more holistic view is often required.
* Don’t forget to monitor and alert on development environments.
* Don’t neglect your own tooling when developing systems.
* Is it acceptable to have a kill switch? If yes, do it.
Author: Grégoire Charvet