At Gousto we have the mentality of “you build it, you run it”. This means our developers are responsible for the design, implementation, testing, running and monitoring of their services on the Gousto Platform, in all environments including Production. This mentality also enables the Platform team to focus on improving our tools and capabilities to make the lives of Gousto developers easier.
This is great in principle but in practice, it requires more than just a statement of intent:
- Knowledge of AWS.
- Knowledge of the Gousto Platform.
- Knowledge of building and deploying code.
- Knowledge of testing and releasing changes.
- Knowledge of logging, monitoring and alerting.
- Experience of debugging and resolving issues.
- Applying learning from incidents.
- Providing feedback on platform tools.
This post introduces our Chaos Engineering Hackathon we recently held which aimed to improve our squads “Experience of debugging and resolving issues”.
Why a Chaos Engineering Hackathon?
With September being a particularly busy time of the year for Gousto, we wanted to assess how ready our Platform and squads were for an anticipated increase in traffic. A key area we identified was how well our engineering teams could respond to unexpected incidents on our platform.
After brainstorming a number of different ideas, which mainly involved going through previous Production issues on a whiteboard, we agreed upon a Chaos Engineering Hackathon. We believed a hackathon would be more practical, engaging and fun for our engineers than the alternatives. The hackathon would involve simulating failures on our squad environments (scaled down version of Production) and have our engineers investigate and resolve them.
Chaos Engineering Hackathon Challenges
We went through previous Production Incidents and categorised them into the following generic areas; Networking, Compute, Database and Config. We then created a number of challenges in each of these areas.
We also agreed upon a scoring system based on the level of difficulty.
Chaos Engineering Hackathon Day:
We gathered the squads in the morning to discuss how the day would be run. Firstly the squads would have to start up their development environment, which had been intentionally switched off. Once the environment had started, the engineers would demonstrate a successful sign-up event to the Platform team to get their first challenge.
The squad would select a category and level of difficulty of challenge for example ‘Compute level 2’. The Platform team would then break the environment and notify the squad when they can start investigating. From here the teams were in competition to see how many challenges they could get through before the day ended at 4pm.
The engineers were very engaged making competition fierce! We had a central leader board and rang a bell each time a squad resolved an issue. It became tactical when squads were choosing which challenges to select and complete within the remaining time of the day. Should they take on fewer difficult challenges with the reward of more points? or focus on completing a higher volume of simpler challenges?
By 4pm our winners were as follows:
What the engineers discovered by participating in the Hackathon:
Below are some of the responses we received from our developers on their learning outcomes:
- “More alerts around specific services [would help diagnose issues quicker].”
- “[Gained a better understanding of] how [AWS] ALB works, how to configure DNS and that rushing can cause more harm than good.”
- “To slow down when trying to solve issues. There was situations we wasted time by trying to rush things which caused us to either miss necessary information which meant we had to circle back.”
- “I also think we should have spent a bit more time up front before each task eliminating unlikely causes and then focusing on the likely ones more. Instead, we explored the unlikely ones a bit too much.”
- “I learned how you can scale up instances quickly from Autoscaling group, I learned how to check DNS rules. I learned how to update triggers…and much more”
- “[Gained] More knowledge of the state of our AWS estate.”
- “Discuss the options more as a team before all heading into the console.”
Feedback on the running of the Hackathon:
- Some of the challenges took longer to resolve than others. For example if the challenge dropped a database column, it could take up to 30 minutes for a backup to be restored. This meant a squad had to wait 30 minutes to start their next challenge.
- AWS savvy engineers used tools such as CloudTrail to understand what changes were being made by the platform team to break their environments. Although this was a valid and creative way to figure out what had happened, it was not the investigation path we were trying to create.
- Some of the challenges took the platform team 10 minutes to set up on squad environments. If two or three squads finished a challenge at the same time it meant they would have to wait up to 30 minutes before they could start their next challenge.
I believe the biggest benefit of the day was that the engineers have improved their knowledge and practical experience of the Gousto Platform and AWS. This is vital for resolving and debugging issue on our Platform. Gamifying and setting practical challenges promoted a higher level of engagement from our engineers than we have seen doing other activities such as talks or workshops.
Since the hackathon there has been better discussions between the squads and platform team on how we can make our platform more user friendly to debug and solve issues. So far engineers appear to be more confident when diagnosing and fixing issues on their squad environment, which has helped reduce support needed from our Platform team. We hope this will also translate into supporting production issues in the future.
I believe the next step would be to start automating (some) of the hackathon scenarios and run them on our environments more regularly, as well as automating recovery from such scenarios. This would free up the time we use for hackathons to focus on other areas of improvement.