Be bold, to better prevent outages
If you want to understand what is our approach to chaos engineering at ManoMano or what is a GameDay? You are reading the right article.
Our brand has a huge momentum and we obviously cannot afford losing our online presence. — — Bob from the Business
Let me introduce myself, I’m Clement, I’ve been working at Manomano for more than 2 years and I’m in charge of an operations team whose mission is to help with governance of the following topics: observability, system & web application performance, FinOps, incident management & chaos engineering.
We are the tech and we are here to support the business!
Production Stability First
We are in hypergrowth, and we have recruited more than 200 people since the beginning of the year. In order to foster innovation, we offer great freedom to the feature teams… While quietly migrating about 150 microservices to fresh kubernetes clusters.
The management of our production is faced with ever greater challenges!
The problems revolve around the paradox of flexibility and productivity in the production organization — changes and deployments
- How to compete while retaining the capability to respond quickly with new releases?
- How to develop new services faster without jeopardizing production integrity?
- How to increase quality & reliability while reducing costs?
- How to build resilient systems that can handle changes?
- How to model capacity & traffic profile?
At Manomano, our systems scaled beyond what any single server — or team — could reasonably be expected to handle, throwing us into the realm of distributed systems, emergent architectures, and redefining how we look at availability and resilience. How do you make collaborators aware of production resilience, reliability & availability issues? Without the production incidents stress, by allowing people to practice, to talk more freely about production problems, and by mixing the teams to create informal chats?
And why not a great event, in full remote of course because we are often at home at the moment… — — My boss & I in a 1&1 meeting.
We have chosen to make regular events around the principles of incident management, failure detection and improvement of architectures to prevent them… We recently ran the second event edition and are already planning the next event for December! GameDays bring together people from across the organization to collaboratively break, observe and recover some of our systems — while staying conscious of customer experience impacts.
Apart from learning how the technical systems respond under stress, some of the main benefits come from the shared understandings and process improvements that are generated. GameDays should be more than just an event or a one-off exercise — they embody an enduring mindset and a culture.
Gamedays allow us to exercise our observability and incident management procedures. Lessons taken from the Gamedays also help rethink our infrastructure, strengthen our monitoring and evolve our policies.
Failure classification
Hope is not a strategy! Unfortunately, we have to be ready to deal with incidents because we will always have them… We have been collecting production incident reports for 2 years and have identified and classified our failures in the following way.
It is from the history of our failures that we were able to work on something fun!
The Villains are working for… great events!
We have built a team of ~20 people at the beginning of September 2020 to work on our first GameDay experiments. We have defined a set of Villains roles and work regularly on workshops every 15 days.
GameDay #1 — Vaccine our system (December 2020)
After 5 or 6 sessions with the Villains, we identified some sensitive points of the architecture and developed 4 experiments to submit to our heroes (~100 attendees) flying the challenge on D-Day. It is with amazement that we saw them accept all our challenges on a dedicated slack channel and that by choosing to experiment failures on the production environment…. NOOOOOOO!
Heroes!
Kick-off of the experiment #4 — Synchronization issues!??
Our beloved CTO Stéphane noticed a weird behaviour on the website, he noticed that sometimes prices displayed on the product page and in the product listing page are not the same! This bad synchronization is driving him and probably other customers crazy! Investigate, find and fix this issue as fast as possible!
Target: Search / Product page / Product listing page
Duration: 20min
Please vote for the environment: no vote, will be PRDEnjoy and good luck!
Did you know ?
Search engine: find items across millions of products
Filter: it allows us to filter the product listing the most efficiently according to categories, prices, the brand and so on!In order to have resilient indexing, we use RabbitMQ so that when the synchonization process fails for some reason, products can still be indexed
later when it is repaired by our teams. We get 20k messages every 5m
PostMortems reports take the same form as those of real incidents, emphasizing where possible what we have learned and the transparency of the process discovered.
During the 2 months following this first GameDay we have seen a nice improvement in the quality of the incident reports. And we have made sure that incident management is understood by more people outside of Tech. We have improved communication during incidents with our customer services or technical account managers, who are the external relays to our customers and sellers.
GameDay #2 — Chaotic Commencement (May 2021)
Based on this experience we organized a second GameDay in May with AWS.
We decided to play one of the oldest GameDay launched by AWS in 2015, (Chaotic Commencement). The goal of this GameDay is to allow the participants to take the hand on an infra that they do not know while having to make sure that they answer at best in front of a traffic always more important… On a more technical level, participants were able to see what the SRE team in charge of operations may have to deal with. And we worked on basic components such as EC2, ASG or more complex ones like ECS.
What was GameDay #2?
This edition of AWS GameDay was a fun time where teams were able to test their AWS skills in a fun and risk-free environment (~50 attendees). We stepped outside the boundaries of typical workshops through open-endedness & ambiguity.
How was it done?
The participants have been divided into teams. Each team was responsible for its cloud architecture — and had to adapt it over the course of the day. There was no one right answer; the path GameDay participants took was up to them, using the provided AWS resources. This was a fantastic opportunity to learn, in a hands-on way, about AWS best practices, new services and architecture patterns.
“Chaotic Commencement” Overview
In this game, participants have been introduced to a rental market. They were asked to take over a sub-optimal application architecture built and maintained by a team that had recently (and abruptly) left the company. Each team had to understand what was really going on in their AWS account and decide what to do to ensure they were building a robust and scalable application…
Who attended?
The Chaotic Commencement version of GameDay was focused on participants debugging and operating EC2 infrastructure, so it’s more directed at Tech operation teams or people that are new to AWS. It was designed for participants to play without having any coding or development experience, though developers or more recently formed DevOps teams will likely find it challenging and useful. Teams not limited to EC2 services and can use container solutions (ECS, EKS), wich allowed advanced players to flex their muscles and find creative solutions.
You can read below the testimony of one of the participants
https://nextjs-blog-pfongkye.vercel.app/posts/aws_gameday
Why, no but why are we doing this?
To recap, we have tested two styles of GameDay, the first one in direct control on the production (Informed in Advance). This first style of GameDay by giving information in advance has allowed us to include in our experiments people who are not part of the tech. And this allowed us to have many enriching chats & outcomes.
The second GameDay style (Dungeons & Dragons) has been made in secure non production AWS accounts but without informing the teams of the challenges they had to face! This allowed the participants to be even more bold and to test patches and ideas, without the stress of production.
This type of event is intended to know us better, to know how we react to the unexpected… And to better understand how applications work on the battlefield by pushing them to their limit.
Expected outcomes
We are organising GameDays to improve our platform robustness and guarantee the best customer experience, yes. But not only. Also embody our human vision and culture: collaboration and learning. More specifically at ManoMano, it’s about bringing people on fake incidents to learn together, but also vaccine our system and be ready to stop the fire. All this achieving confidence and resilience to reach operational excellence together!
- Find extensive, hidden interdependencies
- Confront surprises that challenge
- Know our boundaries, Identify weaknesses
- Look at the saturations, and other signals
- Build trust & confidence
- Train incident response teams
- Having fun by learning together
- Validate Incident management processes
- Improve Internal & external incidents communication
- Validate fallbacks (Disaster recovery)
- Improve our mean time to detect incidents (MTTD)
Resources
To go into more detail on the concepts explained in this article, I suggest you read the following resources.
- AWS Well-architected framework
- Principles of Antifragile Software
- Resilience engineering papers
- How complex systems fail
- Fallacies of distributing computing
- Adopting chaos engineering in your organisation
- Slack DisasterPiece Theater
- AWS re:Invent 2019 — Improving resiliency with chaos engineering
- ManoMano’s journey with EKS (Elastic Kubernetes Service)
- ManoMano’s Incident management with a bot