Inside Gousto’s Chaos Engineering Hackathon

James Carson
Dec 13, 2019 · 5 min read
How to avoid this?

At Gousto we have the mentality of “you build it, you run it”. This means our developers are responsible for the design, implementation, testing, running and monitoring of their services on the Gousto Platform, in all environments including Production. This mentality also enables the Platform team to focus on improving our tools and capabilities to make the lives of Gousto developers easier.

This is great in principle but in practice, it requires more than just a statement of intent:

It requires:

  • Knowledge of AWS.
  • Knowledge of the Gousto Platform.
  • Knowledge of building and deploying code.
  • Knowledge of testing and releasing changes.
  • Knowledge of logging, monitoring and alerting.
  • Experience of debugging and resolving issues.
  • Applying learning from incidents.
  • Providing feedback on platform tools.

This post introduces our Chaos Engineering Hackathon we recently held which aimed to improve our squads “Experience of debugging and resolving issues”.

Why a Chaos Engineering Hackathon?

After brainstorming a number of different ideas, which mainly involved going through previous Production issues on a whiteboard, we agreed upon a Chaos Engineering Hackathon. We believed a hackathon would be more practical, engaging and fun for our engineers than the alternatives. The hackathon would involve simulating failures on our squad environments (scaled down version of Production) and have our engineers investigate and resolve them.

Chaos Engineering Hackathon Challenges

Example challenges:

Example Hackathon Challenges
Example Hackathon Challenges

We also agreed upon a scoring system based on the level of difficulty.

Difficulty level per challenge
Difficulty level per challenge

Chaos Engineering Hackathon Day:

We gathered the squads in the morning to discuss how the day would be run. Firstly the squads would have to start up their development environment, which had been intentionally switched off. Once the environment had started, the engineers would demonstrate a successful sign-up event to the Platform team to get their first challenge.

The squad would select a category and level of difficulty of challenge for example ‘Compute level 2’. The Platform team would then break the environment and notify the squad when they can start investigating. From here the teams were in competition to see how many challenges they could get through before the day ended at 4pm.

The engineers were very engaged making competition fierce! We had a central leader board and rang a bell each time a squad resolved an issue. It became tactical when squads were choosing which challenges to select and complete within the remaining time of the day. Should they take on fewer difficult challenges with the reward of more points? or focus on completing a higher volume of simpler challenges?

By 4pm our winners were as follows:

Points Scored by Squad

What the engineers discovered by participating in the Hackathon:

  • “More alerts around specific services [would help diagnose issues quicker].”
  • “[Gained a better understanding of] how [AWS] ALB works, how to configure DNS and that rushing can cause more harm than good.”
  • “To slow down when trying to solve issues. There was situations we wasted time by trying to rush things which caused us to either miss necessary information which meant we had to circle back.”
  • “I also think we should have spent a bit more time up front before each task eliminating unlikely causes and then focusing on the likely ones more. Instead, we explored the unlikely ones a bit too much.”
  • “I learned how you can scale up instances quickly from Autoscaling group, I learned how to check DNS rules. I learned how to update triggers…and much more”
  • “[Gained] More knowledge of the state of our AWS estate.”
  • “Discuss the options more as a team before all heading into the console.”

Feedback on the running of the Hackathon:

  • AWS savvy engineers used tools such as CloudTrail to understand what changes were being made by the platform team to break their environments. Although this was a valid and creative way to figure out what had happened, it was not the investigation path we were trying to create.
  • Some of the challenges took the platform team 10 minutes to set up on squad environments. If two or three squads finished a challenge at the same time it meant they would have to wait up to 30 minutes before they could start their next challenge.


More of this!

I believe the biggest benefit of the day was that the engineers have improved their knowledge and practical experience of the Gousto Platform and AWS. This is vital for resolving and debugging issue on our Platform. Gamifying and setting practical challenges promoted a higher level of engagement from our engineers than we have seen doing other activities such as talks or workshops.

Since the hackathon there has been better discussions between the squads and platform team on how we can make our platform more user friendly to debug and solve issues. So far engineers appear to be more confident when diagnosing and fixing issues on their squad environment, which has helped reduce support needed from our Platform team. We hope this will also translate into supporting production issues in the future.

I believe the next step would be to start automating (some) of the hackathon scenarios and run them on our environments more regularly, as well as automating recovery from such scenarios. This would free up the time we use for hackathons to focus on other areas of improvement.

Gousto Engineering & Data

Gousto Engineering & Data Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store