What’s new in BOSH — Special edition: Adding some Turbulence

As the PM of the BOSH OpenStack CPI I get to play around with BOSH a lot and have fun with its quirks and subtleties. The learning curve is quite interesting, as pointed out by smarter people than me before, so I’d like to compile a few things that hopefully can make your life easier.

This is the a post in a series. If you just found this, I recommend to also read some previous posts.

Chaos and Turbulence

Netflix gained some fame for “testing in production” with their Simian Army quite a few years ago. If you’re not familiar with the project: In a nutshell it does some failure injection testing in your production system so you can learn about your system’s qualities during business hours and not at 3am in the morning when you get paged. One example is their Chaos Monkey, which “turns off” VMs and services.

In this post, we will add some chaos to our BOSH deployments. To this end, we make use of the turbulence-release. And along our way, we will see some great new BOSH features in action, such as cloud-config/manifest-v2, addons/runtime-config, and links. I’ve uploaded some example configurations and manifest on github.

turbulence-release

The turbulence-release consists of an API server, deployed by a regular BOSH release and an agent, installed as addon. The API server talks to the BOSH director to find out some details about the deployed VMs and has a CPI co-located, so that it can directly shutdown VMs. By sending a `.json` file to the API server, you can schedule incidents or start them directly. Here are the currently supported incident types:

  • kill
  • kill Process
  • stress
  • firewall
  • control network
  • fill disk
  • shutdown

Configuring the BOSH Director

The turbulence API server only speaks over a secured channel to your Director, so you have to deploy the director with a valid SSL certificate. The documentation has a script to generate a root CA and signed certificates for the director and two UAA endpoints. Just add the turbulence API endpoint with its public IP address at the bottom of the script and execute it. Remove the UAA endpoints — unfortunately the turbulence-release doesn’t work with a Director using the UAA for user-management yet.

#!/bin/bash
set -e
certs=`dirname $0`/certs
rm -rf $certs && mkdir -p $certs
cd $certs
generateCert director 10.244.4.2 # ←- Replace with public Director IP
generateCert turbulence-api 10.244.4.100 # ←- Replace with public turbulence API IP
echo “Finished…”
ls -la .

Then place the contents of `certs/director.crt` into to the property `certs/director.ssl.cert` and the contents of `director.key` into `director.ssl.key` in your bosh-init manifest.

Additionally, we setup a separate user for the turbulence-api, called `turbulence` with a defined password. That user only needs read permissions on the deployments. Terminating VMs will be done using the CPI directly from the turbulence-api.

Deploy the Director, and target from now on by specifying the root CA `$ bosh — ca-cert certs/rootCA.pem target 10.244.4.2`. You should be able to login now with your new user `turbulence` and the password you specified during deployment.

Configuring turbulence-release

There are three important things to configure for the deployment of turbulence-release: Connectivity to the Director, a valid SSL certificate for the turbulence agents connecting to the API, and a CPI to interact with your IaaS layer. In this example, we set up turbulence-release on OpenStack, if you use a different IaaS, replace the CPI-specific properties with the ones for your infrastructure.

Take the example manifest from the repository and put the contents of `certs/turbulence-api.crt` into `properties.certificate`, the contents of `certs/rootCA.pem` into `properties.ca_cert`, and the contents of `certs/turbulence-api.key` into `properties.private_key`. To validate the Director’s certificate, also place the contents of `certs/rootCA.pem` into `properties.director.ca_cert`. Be sure to configure `properties.director.username` and `properties.director.password` with the user we created above for the turbulence API.

Use the `bosh-openstack-cpi` release and tell the turbulence API to use it by setting `properties.cpi_job_name` to `openstack_cpi`.

Configuring bosh-openstack-cpi

The turbulence-api uses a co-located CPI — in our case the OpenStack CPI — to shutdown VMs for *kill* incidents. The CPI needs configuration very similar to what you put into your bosh-init manifest, so you can most likely copy that. Check the corresponding section in my example manifest.

cloud-config and deploying turbulence-release

If you had a close look at the example manifest we used as a starting point for the turbulence-api, you might have seen that a lot of things are missing from it. There is no resource-pool and no network definitions, instead they are just referenced by name. Additionally, `jobs:` are now called `instance_groups:`. This is because this manifest uses schema version 2, referencing a cloud-config updated separately at the Director.

So before deploying turbulence-api, we should upload a cloud-config first. You can start using my example and modify it to your needs, or create one from scratch using the bosh.io documentation. Just make sure to provide a `network` and `vm_type` with name `default` — those are being used within the turbulence-api deployment manifest.

Don’t worry if other releases you deploy with this Director don’t use the cloud-config yet. With the newest version, BOSH can now host both manifest types on the same Director. Upload your cloud-config with `bosh update cloud-config`.

As of now turbulence-release doesn’t exist in version 0.5, yet. That means, you have to sync the git repository, create a release and upload it

$ bosh create release --with-tarball
$ bosh upload release dev_releases/turbulence/turbulence-0.4+dev.1.tgz

Also upload the cpi-release for your infrastructure.

Choose the deployment with `bosh deployment manifests/api.yml` and `bosh deploy` your turbulence-api. Afterwards, you can access the turbulence-api at `https://<turbulence-api-ip>:8080/` using the user and password you specified in the deployment manifest.

Introduction to Addons

Addons are part of the BOSH runtime-config and can be used to co-deploy the same release job on every VM installed by a BOSH Director. Popular options include forwarding logs via syslog, adding users, or customizing the login banner. In this case, we will install the agent of the turbulence-release, which allows us to inject certain failures into a running VM.

Updating runtime-config and configure turbulence agent

To have the turbulence agent installed on each VM deployed by BOSH, we need to update the runtime-config with turbulence as an addon. Take the example runtime-config from the turbulence-release repository as a starting point.

Use the turbulence-release version you used in the deployment manifest above for the turbulence-api.

The turbulence agent consumes its configuration automatically from the turbulence-release via links. This way, the addons automatically take the configuration (e.g. username, password, and certificates) from the release and you don’t have to manually change it in two points.

So how do the properties from the turbulence-api get into the turbulence agent’s templates? The job’s template for the configuration file has the statement `link(“api”)`, referring to the `consumes` statement in the runtime-config. A few things are automagically available for users of a link, such as the ip address. If you’re like me and your turbulence-api manifest assigns a floating IP in addition to the private IP, you need to specify the network name in the `consumes` statement, so BOSH knows which IP you are linking to.

Upload your new runtime-config with `bosh update runtime-config`. All following deployments will make use of this configuration.

Deploy something

Deploy a release of your choice and see show how the addon gets also installed during `bosh deploy`. In my case, I use a simple dummy-release.

$ bosh deploy
Detecting deployment changes
— — — — — — — — — — — — — —
(…)
jobs:
— name: dummy_z1
template: dummy
instances: 2
(…)
addons:
- name: turbulence_agent
jobs:
— name: turbulence_agent
release: turbulence
consumes:
api:
from: api
deployment: turbulence
network: vip

The Director gives us a hint that we’re using a `networks` section in the manifest, and are therefore not using `cloud-config`:

Deprecation: Ignoring cloud config. Manifest contains ‘networks’ section.

That shows us, we can now deploy manifests that make use of cloud-config on the same Director as our existing deployments which don’t use a cloud-config yet.

Schedule some incidents

Now we are ready to schedule at least two types of incidents:

  • a kill incident, which uses the CPI deployed with the turbulence-api
  • a stress incident, which uses the turbulence agents deployed on the VM to introduce additional load

Send a kill incident for all instance_groups ending in ‘z1’, limiting to 1 instance per instance_group.

$ cat kill_incident.json
{
"Tasks": [{
"Type": "kill"
}],
"Deployments": [{
"Name:: "dummy",
"Jobs": [{
"Name": "*_z1",
"Limit": "1"
}]
}]
}
$ curl — cacert certs/rootCA.pem https://turbulence:p@172.18.x.y:8080/api/v1/incidents -H 'Accept: application/json' -d@kill_incident.json

The Web UI at the turbulence-api node tells us that an incident has been accepted and succesfully executed:

Send a 30 second stress incident for all instance_groups ending in ‘z1’, limiting to 1 instance per instance_group. Using 1 process for CPU stress and 1 process for IO stress.

$ cat stress_incident.json
{
"Tasks": [{
"Type": "stress",
"Timeout": "30s",
"NumCPUWorkers": 1,
"NumIOWorkers": 1
}],
"Deployments": [{
"Name": "dummy",
"Jobs": [{
"Name": "*_z1",
"Limit": "1"
}]
}]
}
$ curl --cacert certs/rootCA.pem -X POST https://turbulence:p@172.18.x.y:8080/api/v1/incidents -H 'Accept: application/json' -d@stress_incident.json

And we can check that something is actually happening:

$ bosh instances --vitals
+--------------------+-----+--------------------+
| Instance | ... | CPU % |
| | | (User, Sys, Wait) |
+--------------------+-----+--------------------+
dummy_z1/0 (d8b...)* | ... | 45.4%, 54.5%, 0.0% |
dummy_z1/1 (0b4...)* | ... | 0.0%, 0.0%, 0.1% |

Now go ahead and inject some failures!

Additional reading

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.