EXPEDIA GROUP TECH — SOFTWARE
London Chaos and Resilience Engineering Community
Meetup at Expedia Group
On 6th November, 2019, Expedia Group™ hosted the London Chaos and Resilience Engineering Community meetup at its offices in Angel. Three speakers from Expedia™, Hotels.com™, and Vrbo™ shared their journeys in chaos engineering.
Dina Abu Khader, DevOps Engineer at Expedia™, introduced resilience and chaos engineering with a discussion of the benefits and challenges of a microservices architecture. Some of the advice she shared for improving systems’ resilience included deploying frequently with fewer changes and adopting chaos engineering practices and tools. She shared the history of chaos engineering from its Netflix roots to recent tools such as Gremlin, which makes onboarding and failure injection easy. Dina described an architecture that uses Elasticsearch and explained the concepts of known-knowns, known-unknowns, unknown-knowns, and unknown-unknowns. Finally, she gave ideas about experiments that one could run on such architectures.
Sasidhar Sekar from Hotels.com shared the tech stack used at Hotels.com and focused on its cloud setup and in particular on design decisions:
- Self-hosted Kubernetes on a multi-AZ cloud environment (AWS)
- Minimum number of replicas
- Replica distribution
- Traffic distribution among replicas
He introduced AZKilla, a tool built by Hotels.com that simulates availability zone (AZ) network failures using network access control lists (ACLs). Five scenarios with clear hypotheses were presented to test the infrastructure for AZ resiliency, each of which gave invaluable feedback that led to actions. Issues revealed were related to blast radius, mean time to recovery (MTTR), AZ failover, and even distribution of pods across AZs, and other factors. Sasidhar closed his presentation with learnings and next steps which included further experiments and chaos tests.
Guy Keren and Raamnath Mani from Vrbo introduced their Chaosbox toolkit. Their tech stack includes Mesos, Consul, and Linkerd. They explained why they built an advanced toolkit to use in addition to other tools such as Toxiproxy: It needed to be developer-friendly, HTTP-path based, and provide support for HTTPS and for Layer 4. Chaosbox is a config-driven and platform-agnostic tool that can be used to emulate network failures on Docker containers. They even showed some pretty fascinating demos with failure injections on Cassandra and HTTP dependencies and their impact on the website. The engineers from Vrbo closed by sharing their long term vision for Chaosbox, which includes integrations with monitoring, logging, continuous integration (CI), or even messaging tools.
Thanks to Dina, Sasidhar, Guy, and Raamnath for presenting!