Creating Reliability Through Chaos With Azure VMs and Gremlin

Published in

Microsoft Azure

6 min readOct 8, 2018

Sometime in the early 2000s I worked for a company that provided datacenter co-location and managed server hosting. Before midnight on a Saturday, the local utility power to the datacenter was knocked out due to a fairly violent storm, affecting the3,000 servers hosted there. Whenever this happens, an automatic transfer switch (ATS) is supposed to recognize that power is no longer flowing to the datacenter, and to immediately start and switch over to generator power. That didn’t happen. These 3,000 servers all lost power and required manual intervention to recover almost 60% of the machines. That meant fscks, chkdsk and many other different types of recovery methods were required almost all night. Some servers simply never came back online and needed full restores. Hours of work was spent on getting the servers that belonged to our customers back and running.

The postmortem of this outage was absolutely brutal. We learned that no one tested this the power transfer process prior to the datacenter going online. Because of that, no one considered what the potential impact to the datacenter would have been if the ATS was not functional. Thousands of dollars in customer credits were lost as well as the faith our customers had in providing them with the uptime they depended on. This was really my first introduction to what “Chaos Engineering” is now, the idea destroying portions of your infrastructure and determining the result.

When building distributed systems, we must always consider that failure is almost certain. One of the better speakers on cloud computing, Werner Vogels likes to say, “Everything fails all the time,” and he’s right. The constant potential for failure is something that is almost built into the cloud now. You may be building apps in multiple datacenters across several regions. You may be putting faith in a third party that you’ve picked the right provider and proper configuration that no matter what, your app will survive an outage.

Introducing Chaos Engineering

The idea of “Chaos Engineering” isn’t just about putting faith in a provider to stay online, it’s finding ways to simulate failure in order to determine that you’ll withstand an outage of any kind within your application. This means that if a number of your app servers take on a large portion of traffic and are highly CPU taxes, you’ll know how to properly scale your application to withstand it. If portions of your application infrastructure were to take on a massive amount of packet loss, how does your team respond?

Chaos engineering helps answer some of these questions by allowing you to simulate the possibilities of what a failure may look like in your production environment. For some, using tools like Chaos Monkey has helps produce load and service failures to help create attack simulations. Lately I have been working with Gremlin, which acts as a “Chaos-as-a-Service” through a simple client-server model.

In this tutorial, I will provide you with a short demo of how to install the Gremlin agent on a production cluster and then create an attack using the web portal. In this case I will be using a set of three Azure VM’s with Ubuntu Linux on them running MongoDB.

Getting Started

Getting started will require you to sign up for a Gremlin demo account by going to the Gremlin website. You can also sign up for a free Azure account as well with $200 in trial credits by going here.

To configure our three node replica set within MongoDB, check out the Azure docs and the MongoDB docs. Once your Azure VMs and your database is configured, you can create some dummy data to help generate load on the servers. I like doing something like this:

dbcluster01:PRIMARY> use demo switched to db demo dbcluster01:PRIMARY> for (var i = 1; i <= 250000000; i++) { … db.testData.insert( { x : i } ) … }

By doing this I’ve created a loop that will generate documents and these will be replicated over to our secondary servers to our primary. Now I can run top and take a look at the impact:

Installing Gremlin

Now I am going to install Gremlin so I can begin creating attacks (following this tutorial in the Gremlin docs for more; it’s pretty straightforward).

SSH into Server, configure deb, install GPG key
Run a Gremlin Syscheck
Register the nodes with the gremlin service
Add tags to name your nodes

After I install the Gremlin app, I will register the node and tag it. I’ll use the following across the nodes with the appropriate name:

gremlin init \ --tag service=api \ --tag service-version=1.0.0 \ --tag service-type=http \ --tag tag_name1=`echo $HOSTNAME`

You will be prompted to provide your Gremlin Team ID and Secret Key, these are found in the “Company Settings → Team” section of your Gremlin control panel:

My nodes are now added and active in Gremlin, there are various bits of detail in tag format available to you now in the Gremlin Control Panel. Now it’s time to create an attack!

Creating an attack

Creating an attack to one of our nodes requires us to go into the “Attacks” portion of the Gremlin control panel then clicking the green “New Attack” button. You’ll be provided with information on which hosts you’ve installed the agent on.

I am going to pick the primary server, 10.1.1.4 to execute my attack on, so I am going to click the checkbox next to it:

Now I can choose a “Gremlin” or a specific attack I will unleash on this new server. In this case I will perform a “Resource” based attack on CPU. This attack will require me to specify the length of the attack in seconds and the number of CPU cores to hog. In this case, I will set the length of the attack to 300 seconds and have it use four CPU cores (the total available on the B4ms VM size).

When I am ready to run, I can schedule the attack for a later time or execute it now:

I will start the attack now and click “Unleash Gremlin.”

The attack is now active! You can see that our primary node load average has jumped through the roof:

I can also re-run old attacks I have created, in this case I am also running a 60 second memory attack:

Once the attacks are completed, I can review a team report on the progress of the attack I have executed and the information on the total number of active, revoked and suspended attacks:

Conclusion

That’s it. You’re now able to use Gremlin on the Azure Cloud. If you’re looking for more complicated app level based attacks, check out Gremlin’s new ALFI Attacks which focus on targeting specific parts of your application to understand how your system and the operators respond.

Originally published at jaydestro.org on October 8, 2018.