How to burn $4500 in 20 hours and get no pleasure from it

GoustoTech
Sep 15 · 5 min read

As the title suggests, we had an interesting experience, learned a few things from it, and now we’ve got to share it on the www.

What you can expect from this post:

* How testing environments aren’t risk free (cost wise at least).
* How multiple unrelated changes can morph into a massive bug.

^ not what you want to see for something in a dev environment

Some context

My team (called Peas) is responsible for finalising Gousto customers’ orders and routing them to the most adequate fulfilment centers. We are in the process of developing a new service which looks at orders and, based on various metrics, makes an advanced routing decision. The goal is to have a better idea of what orders we will have to fulfill before they actually reach our factories, which helps with planning and stocks. This process is repeated several times for each order to adjust to changes.

The routing service is a container running on ECS, and waking up on a cron. Its job is pretty simple, grab all relevant orders for a given date, and, with the help of some additional data (various configurations) decide which order should go to which site.

Importantly, we store all previous routing decisions for orders.

Where did it go wrong?

In the meantime, another pair of devs was working on the lambda function, and a tool to import orders. Such a tool is crucial when developing, when fixing bugs, and also not to miss orders modified before the service went live.

When we ran this tool in the dev environment, that’s when everything went south. We imported a subset of almost-real orders for testing purposes. Suddenly the database was very much not empty anymore, and the routing service had some real work to do. Every minute, a new service would spawn, start scanning all orders, and augment them.

DynamoDB charges by the KB read and written. And on each iteration, every order record would become slightly bigger, so reads and writes would become gradually more and more expensive.

This resulted in a rapid linear increase in billing for dynamoDB. Here’s the read consumed:

The write graph shows a similar curve, with higher numbers because writes are more expensive than read.

As soon as we realised what was going on, we shut down the service, increased the cron frequency and purged the table. Most orders had more than 1000 routing decisions attached to them. Between dynamoDB, ECS, a few hundred million messages published and a few other bits of infrastructure, this cost our team around £4500.

Learnings

Detection tooling in development environments

Detection tooling in development environments. Although we have better tooling in production, we don’t have much in terms of anomaly detection for other environments. AWS cost alerts also tend to lag a bit behind. We received the alarming reports a day or two after the start of the incident.

Interplay between complex systems

Although we are a team of 5 engineers and we pair quite often, no one saw this coming. When working on small individual tasks on complex systems, it’s very easy to forget the big picture. Especially on bigger projects, where simply keeping everything in one’s head is challenging. This is more of a prompt for everyone to reflect on their development practices.

Development tooling

We had a very high cron frequency because we didn’t invest in development tools. We had no other way to see our changes in AWS other than waiting for the cron to trigger the task. This incident helped to drive home the importance of these tools, which usually pay for themselves many times over, even though they seem like a dead weight at the beginning of a project.

Kill switch mechanism

You know the big red button present on most (all?) big machines to immediately stop their operation in case something goes horribly wrong? It’s good to have something similar in software systems. Bonus points if these can be triggered automatically. Not all systems have that luxury, but in our case it’s completely acceptable to be down for a few hours until the team can fix it properly.

Conclusion

In summary:

* Don’t forget to monitor and alert on development environments.

* Don’t neglect your own tooling when developing systems.

* Is it acceptable to have a kill switch? If yes, do it.

Author: Grégoire Charvet

Gousto Engineering & Data

Gousto Engineering & Data Blog

Gousto Engineering & Data

Gousto Engineering & Data Blog

GoustoTech

Written by

The official account for the Gousto Technology Team, a London based, technology-driven, recipe-box company.

Gousto Engineering & Data

Gousto Engineering & Data Blog