Is your cloud application resilient?

Shailesh Hegde
Tech Talk & Travel Tales
4 min readMar 12, 2020

Testing a distributed system like a cloud application poses its own set of challenges. Interactions between different parts of the system need to be monitored and validated for correct functionality. This becomes more so when required to test the system for resiliency, i.e., exhibiting correct behavior under faulty conditions. Based on definition [1], a reliable distributed system has the following properties:

  • Fault-Tolerant: It can recover from component failures without performing incorrect actions.
  • Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
  • Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.

Failures are inevitable

Assumptions are made when designing distributed systems. These are so well-known in this field that they are commonly referred to as the “8 Fallacies”.

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn’t change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

In distributed systems, failures happen all the time. These may be hardware or software, hard or partial, etc. Hardware failures may occur due to overheating of chips, disk drives going bad, mechanical issues etc. Software failures, on the other hand, are caused by bugs in the code run of various components of a distributed systems.

A detailed list of possible failures types is given below:

  • Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending “I’m alive” (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
  • Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
  • Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
  • Network failures: A network link breaks.
  • Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
  • Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
  • Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc.

Resilience

the capacity to recover quickly from difficulties; toughness.

Any of these failures may occur in the system, and one must have solutions in place to deal with these. We at BlueJeans Network, have developed an Automated Resiliency Test Framework called Goblin that helps in finding the state of resiliency, release over release.

Goblin

The testing challenge is to figure out if the system has indeed been hardened against these failures. Goblin helps in simulating the failures and understanding the impact/bug. To do this, the following steps are necessary:

  • Load the system under test with required functional/load test - this varies from system to system, so Goblin leaves this to the user to hook into
  • Introduce failure(s) in the system reliably
  • Use Goblin to stop services, create network errors etc
  • capture system behavior under the fault condition
  • this varies from system to system, so Goblin leaves this to the user to hook into
  • compare observed behavior with expected behavior
  • this varies from system to system, so Goblin leaves this to the user to hook into
  • recover the system to normal state
  • publish results for report generation
  • Goblin provides hooks to generate JUnit like reports
  • Continuous integration
  • Goblin can be configured to be executed from Jenkins and run as CI/CD or as nightly job

What is Goblin ?

Goblin is a junit style test framework, written in Ruby using test-unit harness. It uses ssh gem to execute commands on various nodes of system under test. It takes care of collecting and reporting test results. We used it at BlueJeans Network to great effect.

Salient Features

  • provides an automated test framework to compare different releases of the system
  • comparison of resiliency standards across different builds installed on the system
  • for live testing, the desired Group Chat application (HipChat for example is supported) may be integrated, such that test steps and results may be immediately visible to the parties interested.
  • abstracts the Chat application settings such that any API based Chat application may be integrated. For example, Slack, Webex, Gtalk etc.
  • Is a test framework (runs against any environment)
  • Executes predefined test cases one by one
  • A test case
  • Simulate customer usage with some system load
  • Cause a set of predefined failures
  • Validate the impact of the failure
  • Recover system from failure
  • Report the results (JUnit style)
  • Used for regression testing (compare with previous release)
  • Also, supports test bashes with real participants

What does Goblin do?

  • Runs like functional tests
  • Start test
  • Cause the predefined failure (say, zk process kill)
  • Validate result
  • Recover the system from failure
  • Report the results (JUnit style)

Summary

Keeping a cloud application running at all times is a challenge. Simulating real world conditions is of paramount importance. While Goblin proved useful at BlueJeans Network, new companies like Gremlin are helping do this in a scalable manner across different workloads and conditions.

--

--