London Chaos and Resilience Engineering Community

Details of the January 2020 Chaos and Resilience Engineering London event held at Expedia Group including slides and talk outlines

Nikos Katirtzis
Expedia Group Technology


Here at Expedia Group™ we have a strong commitment to supporting the tech community, and our newly refurbished London office has a purpose built area for internal and external meetups. On 28th January, 2020, Expedia Group hosted the first London Chaos and Resilience Engineering Community meetup for 2020 at its offices in Angel. This time we had Florian Rathgeber from Google Cloud, Russ Miles from ChaosIQ, and Crystal Hirschorn from Condé Nast sharing their experience in resilience and chaos engineering.

How to SRE — Florian Rathgeber, SRE at Google Cloud.

Florian Rathgeber, SRE at Google Cloud, introduced SRE by going through 4 key principles. The first principle is about having Service Level Objectives (SLOs); he explained what SLOs are, introduced error budgets and talked about the need of having consequences. The second principle is to make tomorrow better than today; this can be achieved by eliminating toil using automation so that SRE teams can focus on project work. By implementing the third principle SRE teams are able to regulate their workload. This relies on a shared responsibility model, on leadership buy-in, and is achieved by considering reliability and consistency up-front. The last principle sees failure as an opportunity to improve. It is implemented by building a culture of blamelessness where failures are embraced. This is where Chaos Engineering comes into play.

Slides: https://speakerdeck.com/frathgeber/how-to-sre-lessons-learned-from-16-years-of-google-sre

From Chaos to Verification — Russ Miles, CEO of ChaosIQ.

Russ Miles, CEO of ChaosIQ, talked about limitations and challenges of Chaos Engineering. What experiments to run, why taking the risk of doing chaos experiments and what to invest in are questions which the current tooling cannot answer. Starting from low level conditions such as disruptions and attacks and then moving to mid level orchestration with their Chaos Toolkit, his team has recently built ChaosIQ which aims to provide continuous verification to help with high level decisions. ChaosIQ aims to act as a single pane of glass by providing a user-friendly UI that includes dashboards with results, objectives, or even actions that need to be taken.

Slides: https://www.slideshare.net/russmiles/from-chaos-to-verification-at-expedia-group-london

Embedding a culture of experimentation and resilience at Condé Nast — Crystal Hirschorn, VP Engineering, Global Strategy & Operations at Condé Nast.

Last but not least, Crystal Hirschorn from Condé Nast explained how they’ve embedded a culture of experimentation and resilience at the American mass media company. The evolution of system architectures which moved from monoliths to microservices and from running on physical machines to containers and more recently as serverless functions has increased the complexity of the systems we are running. When it comes to resilience and chaos engineering, teams need to experiment effectively using the right tools and practices. Crystal has introduced Game Days at her company with great success. With tools like Gremlin or Chaos Mesh, Game Days can reveal issues related to observability or even engineering team’s readiness to respond to production incidents. Crystal’s closing message was that we need to focus on actions such as on better understanding our service’s architectures, on simplifying and reducing noise in our metrics dashboards or even on organizing more Game Days to fix our incident management processes.

Slides: https://speakerdeck.com/chirschorn/embedding-a-culture-of-experimentation-and-resilience-at-conde-nast

Thanks to our presenters and to everyone that joined us. Stay tuned for the next meetup!

London Chaos and Resilience Engineering Community meetup at the Expedia Group offices.
Learn more about technology at Expedia Group

