Chaos Engineering comes to Ruby

Laszlo Papp
Mar 31, 2020 · 4 min read

In a micro-service environment, an unexpected service outage often comes as a surprise and when it happens it’s stressful for engineers, managers, and the clients. If you want to prepare your Ruby application for the next surprise and prevent it early then this post is for you. I will show you how you can simulate a connection failure with timeout in your Ruby application and check if it can handle a service outage and recover from that.

Chaos Engineering is the discipline of experimenting with injecting harmful behaviors into the software to prepare the system and the engineering team for unknown situations. The idea behind chaos engineering is similar to flu prevention. Doctors inject a weaker form of the flu virus into the human body to prepare its defense mechanism for the real virus so it can memorize, practice the recovery. In a multi-service environment, viruses are network blips, network latency, memory and CPU spikes, unlikely situations. Injecting these sorts of issues into a production environment in a controlled way helps to set up and test backup plans, minimize the downtime of your service and also put less stress on you in the future.

Flu Shot

Flu Shot is an open-source Ruby gem that allows you to inject harmful behaviors into your application and control the behaviors externally like a control panel for a train layout. Using Flu Shot you can emit and simulate unusual events in your production environment in a controlled way and test your app’s resiliency. You can find the project on Github.

First, you must install the gem before by gem install flu_shot or add it to the Gemfile of your application and run bundle install.

# Gemfilegem 'flu_shot'

FluShot.inject defines a point in the execution flow where the harmful behaviors can be added later. The following example adds the user_controller_show injection point into theUsersController#show method right before the user is fetched from the Users Service. It’s not doing anything at the beginning, we will configure it later.

class UsersController < ApplicationController
def show
FluShot.inject(:user_controller_show) # injection point
user = UsersService.find(params[:user_id])
# ...
end
end

Basically, we will inject vaccines, weak harmful behaviors into the system. FluShot::Vaccine classes define the behaviors that can be executed at the injection points. Every vaccine needs to be labeled by using label :label_name. The behavior that the vaccine contains goes to the constructor method and additional parameters can be passed in a hash argument.

In my example, theLatency vaccine adds random length of sleep in[min..max] range simulating slow service. The min and the max values of the range must be passed in the params hash.

class Latency < FluShot::Vaccine
label :latency
def initialize(params = {})
sleep(rand(params[:max] - params[:min]) + params[:min])
end
end

You can also raise and simulate exceptions in a vaccine by using FluShot::Sneeze that encapsulates an exception. The reason why we need to use the Sneeze object is Flu Shot catches every exception that is raised in the vaccines to make sure it does not abort your app. In order to raise exceptions from vaccines, the exception needs to be wrapped into a Sneeze object. The following vaccine will raise a Faraday::Error::ConnectionFailed exception.

class ConnectionError < FluShot::Vaccine
label :connection_error
def initialize
raise FluShot::Sneeze.new(
Faraday::Error::ConnectionFailed.new('Connection Failed')
)
end
end

FluShot::Prescription associates vaccines with injection points. The Prescription provides a .for method to specify the injection point and it returns a block with a prescription local variable. You can specify which vaccine needs to be executed and with what parameters by using the .add method on this local variable.

The next example simulates a Faraday connection failure with a random timeout in UserController#show. The prescription is written for theuser_controller_show injection point that we have already defined before. Using the prescription object, we can add a latency vaccine with [1..3] seconds random timeout and raising a Faraday::Error::ConnectionFailed exception by the connection_error vaccine.

FluShot::Prescription.for(:user_controller_show) do |prescription|
prescription.add(:latency, {min: 1000, max: 3000})
prescription.add(:connection_error)
end

Now if you call theUsersController#show method, it will raise a connection failure error and simulates when the Users service is unreachable. If you have some test accounts, you can add some filters to execute the vaccine only for test accounts and in this case your clients are not affected at all.

Leaving the block empty will reset the prescription to no-op.

FluShot::Prescription.for(:user_controller_show) do |prescription|
end

By default, Flu Shot stores the prescriptions in memory and it works fine for single-process applications. Multi-process apps require shared memory to allow the processes to communicate with each other, therefore Memcache or Redis like services are necessary. In case you use Redis, you can pass your Redis connection instance to FluShot::Config.storage by wrapping it into FluShot::Storage::Redis. If you put the following lines into your initializers, new prescriptions will be applied automatically in each process of your app.

require 'redis'FluShot::Config.storage = FluShot::Storage::Redis.new(Redis.new)

If you use the Redis storage, you can control your application directly from Ruby console by writing prescriptions there, otherwise, you need to prepare your application to receive commands through HTTP and define new prescriptions configuration there.

Contribution

Flu Shot is looking for contributors for writing vaccines for popular third-party libraries like Faraday and Rails, and to create a “pharmacy” for the vaccines. This project is in the early phase, your ideas and recommendations are welcome. If you like the project but you don’t know how to contribute, please star it on Github.

You can find me with questions on Twitter or leave a comment below this post.

Zendesk Engineering

Engineering @ Zendesk