Why should Chaos be part of your Distributed Systems Engineering?

Why do we need some Chaos?

We have all seen software systems grow in size and complexity over time. Traditionally, the complexity of the colossal on-premise enterprise systems have been a deterrent to frequent system changes, which in turn resulted in a more conservative approach towards software development and release management. With the rapid adoption and innovations in cloud computing, new and better systems architecture and software delivery have evolved to meet the fast changing needs of unprecedented scalability and availability requirements for BigData, AI/ML, IoT, Gaming, Social and Mobile applications, sometimes at global scale.

Agility and high resiliency needs of such systems also mean they need to be highly distributed, loosely coupled and redundancy built in to avoid single points of failures. In addition, microservices architecture can add further complexity in terms of security, networking, service discovery, management, distributed tracing, monitoring and insights. 
Traditional approach to monitoring might not be enough to get the required level of visibility to manage such highly complex distributed systems. There needs to be a balance between low level granularity and high level summarized view of monitoring metrics to avoid alert fatigue and maintain efficiency of SRE teams at the same time.
In spite of the many controls that are put in place to enable SREs to quickly react to the unwanted incidents in the production systems, we need controls that can help us avoid the incidents proactively as much as possible.

Chaos engineering principles can help us build resilient systems by introducing controlled failures as part of the system design and operations. It enables us to induce failure scenarios at various layers/components/services of the overall system. With the possible failures built in to the system, controlled disruption events can help SREs get better visibility into various possible outage and service disruption scenarios. This in turn helps teams make the systems more resilient.

How to introduce Chaos into your Systems Engineering discipline?

To start with, if you are not already familiar with the basic principles of chaos engineering, this is a good place to start — https://principlesofchaos.org/.
When it comes to designing the chaos for your specific systems and platforms, it depends on the tools and technologies that you use. Needless to say, if your system is on a public cloud infrastructure like AWS, Google Cloud or Azure, design principles or areas to introduce controlled failures might be very similar across the cloud providers. Some of the points to consider while introducing chaos engineering into your organization or systems engineering practice are:

  1. Chaos engineering, being a new discipline, would need some amount of mindset shift, developer/SRE training and stakeholder on-boarding. There would definitely be a period of imbibement and adoption that needs to be considered. Acceptance of controlled failures need to be encouraged.
  2. Similar to other software engineering best practices, chaos engineering should also be started small and iteratively increased in scope. Although, the end goal should be to induce controlled failures and chaos in production systems, always start small on development environments and move across environments iteratively with every new failure scenario that is being introduced into the system.
  3. Most likely, you would need to deal with a moving target to a large extent. So, it is important to define a somewhat steady state and design towards it. Also, gather data points to learn over time about any changes in the definition of the steady state so that the failure scenarios can be adjusted accordingly. For example, a website being accessed in only a specific geographic region needs to be expanded for global customers. This would require changes to the current architecture, and so the failure scenarios would also vary to a great extent.
  4. Build automation for the chaos experiments. Automation would enable you to scale to your needs and run more frequently in controlled environments. Otherwise, with adhoc manual experiments, chances are that the efforts might fizzle out over time.
  5. Chaos experiments’ data points and insights needs to be shared among the stakeholders and used to build a better system in iteration.

Now, with these fundamental considerations, you might be wondering which are the areas to introduce failure scenarios. There can be numerous areas in your system where you might expect failures to happen. Let’s say your system is on AWS. Some pointers that might be helpful are:

  • EC2 instance gets terminated.
  • EC2 autoscaling group not responding to scaling events.
  • S3 API level failure (You might be aware of https://aws.amazon.com/message/41926/ ??).
  • Docker container failures.
  • Latency of a service.
  • Lambda functions slow response time.
  • Slow database queries.
  • Database failovers.
  • Distributed data processing.
  • CPU, Memory, Disk IO, Network hog.
  • Disk failures.
  • Process/Service failures.
  • Availability Zone becomes unreachable.
  • Regional outage.
  • … and so on.

Another factor while deciding on the areas to introduce controlled failures is the availability requirements (SLA/SLO/SLI/RTO/RPO etc.) of your system.

Concluding thoughts

Chaos engineering is a new area of systems engineering which is yet to be widely adopted, and companies like Netflix, Google, Amazon, Microsoft, Facebook (I am sure I might be missing some great names here.) have been pioneers in the area. In case you are not already aware of, Netflix’s Simian Army is a popular tool suite in the chaos engineering space. Refer https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116.
A lot of companies have started incorporating chaos engineering practices into their systems engineering as they transition to cloud platforms which undoubtedly empowered organizations with new capabilities to innovate and explore new ways of operating their systems. Of course, chaos engineering might not be feasible to be implemented on legacy heavy enterprise systems. But, as systems are being migrated to newer cloud native, serverless or microservices architectures, we can expect much wider adoption and chaos engineering becoming a part of the overall systems engineering and SRE practices.

For your reference, I found https://github.com/dastergon/awesome-chaos-engineering to be a very good place to find a lot of references to various related contents.