You Don’t Need $1MM for a Distributed System

9 min readFeb 17, 2015

Setting up containers, load balancing, and service discovery on light hardware

The biggest problem in software is a human one.

If programming were just a matter issuing commands once a day down from on high, we could write our fancy new app in machine language, order a pizza, and go back home to binge out on Netflix. Unfortunately, in the real world, someone is going to have to come back to maintain that mess of code, and they’re going to be pissed when they find out you’re watching House of Cards for the third time.

The real problem we deal with, then, isn’t whether or not the machine is capable of doing what we want—it is. The real problem is that we have terrible memories and poor attention to detail. Our biggest challenge is still how to stop ourselves from screwing everything up.

If this is true for software, it’s doubly true for distributed systems. The biggest challenge in a distributed system for the average jo(e) won’t be a Byzantine Generals problem—it’ll be whether or not they can even find the damn computer that request X was processed on for Y date and oh did you remember that box Z was taken out for maintenance because we ran out of letters in the alphabet and we only allocated one character for the names?

But wait!

I hear you say.

I can’t really learn those skills without having a lot of hardware!
It’s a catch 22!

First, stop talking so loudly. You’re taking up more space than I am. Second, it’s simply not true.

Sure, experience with Big Fat Boxes can teach those skills, but only in the same way owning Jimi Hendrix’s 1.3 million dollar Stratocaster can make you a great guitarist. You might start playing 64th note solos, but more than likely you’re just going to make the same mistakes you would have made with a $100 guitar (and who are we kidding, the thing will be so expensive you’ll probably just be too afraid to touch it).

My point: once you get over a handful of anything, you have to start thinking differently. Whether it’s 10 or 100 or 1000 servers — a large number is a large number. If you can’t coordinate 10 servers and handle the problems associated therein, what makes you think you’ll be any better when a few thousand are involved?

I say this not to discourage. My point is that you can still get some of the most important skills on commodity hardware — skills such as managing complexity across systems and gracefully handling when a node inevitably fails.

Where We’re Going

My aim with this piece is to begin to show the possibilities that are available to you today. All of the below was done in a single afternoon, on a $20/month Digital Ocean instance (i.e. about $0.50 worth of computing time).

We’ll go through creating containers with Docker, setting up service discovery with Consul, and dynamically balancing the load with Nginx. We’ll start with creating a simple web application to use as our example.

Making the Application

For our application, let’s do something simple: create an HTTP server that does some work lasting a few seconds. The first thing to do (after installing Docker) is to launch a container. I’m working on Ubuntu, so I’m using that as my base:

sudo docker pull ubuntu
sudo docker run -i -t ubuntu /bin/bash

This should drop you into a fresh shell as root into your new container. Now, let’s install Python:

apt-get install -y vim python python-dev python-distribute python-pip

As well as Flask, for a simple server:

pip install flask

Next, let’s create our Flask application, making it calculate a prime number and return back the result (I’m aiming for ~5 seconds of processing time here):

Now let’s get the container’s IP:

root@dockbox:/home/ubuntu# docker inspect 6b0262f0fe5c
[...]
        “IPAddress”: “172.17.0.13",
[...]

Now, simply curling the endpoint should give us a fresh, randomly generated prime number:

root@dockbox:/home/ubuntu# time curl 172.17.0.13:5000
63386263
real 0m5.948s
user 0m0.000s
sys 0m0.009s

And it took about 5 seconds — the sweet spot I was aiming for.

Cranking Up the Containers

Let’s kick it up a notch and launch 20 more of these things. First, let’s commit our container to an image so we can launch more:

docker commit 6b0262f0fe5c prime:v1

Next, a quick sanity check, launching from our new image:

root@dockbox:~# docker run -d prime:v1 python /root/work.py
root@dockbox:~# docker inspect 09a72800be9e | grep "IPAddress"
        "IPAddress": "172.17.0.16",
root@dockbox:~# curl 172.17.0.16:5000
55098829

Perfect. Now launch the rest:

for i in $(seq 1 20); do docker run -d prime:v1 python /root/work.py; done

And just like that, we have 20 web servers ready for handling our application. Let’s move on to actually serving requests to our new, distributed application.

Statically Load Balancing the Containers

First, we’re going to need nginx on our host:

apt-get install nginx

Then, we’ll need our containers’ IP addresses:

for i in $(docker ps -q); do docker inspect $i | grep "IPAddress" | cut -d" " -f 10 | sed 's/[^0-9\.]*//g'; done

Finally, we’ll use the output to create an nginx config (/etc/nginx/nginx.conf on my system):

In a bit, we’ll come up with a way to register this with service discovery, but for now this works. After reloading nginx:

service nginx reload

And loading it up in the browser, we see our beautiful prime number:

If you didn’t know any better, you’d think it was just an every day number.

Not too shabby. Keep in mind that without a firewall, this puts whatever you are load balancing to open to the public. Right now I don’t really care, but if you’re opting for something more sensitive, you should set up some iptable rules or have your server only listen locally.

What now?

One of the most frequent problems you have to deal with in distributed systems is how to manage where traffic is sent. This sounds like a great use case for service discovery.

Running Consul in the Containers

If you aren’t familiar with Consul, or service discovery, check out this nice introduction. Otherwise, I’m assuming you have a cursory understanding. Let’s dive right in.

Setting Up the Server

I personally am using this ppa for convenience’s sake.

apt-add-repository ppa:bcandrea/consul

By default, this repo starts consul in agent mode, and I’d rather have the host be a server for this project. It also binds to eth0, and I don’t particularly want this to be open to the public. Let’s change those things.

Here’s my modified /etc/init/consul.conf:

Github’s gist syntax highlighting seems pretty broken.

Now, after stopping and starting the server, we’re in business:

service stop consul
service start consul

Setting up the Agents

Let’s kill our running containers from before, and modify our base image to set up the consul agent. You can stop a container by simply calling:

docker stop container_id

Now, let’s launch our image again:

docker run -i -t prime:v1

And inside the container, follow the previous steps for installing Consul, with an extra step to get access to apt-add-repository and curl for doing health checks:

apt-get install software-properties-common curl
apt-add-repository ppa:bcandrea/consul
apt-get update
apt-get install consul

Since the container doesn’t have a fully functioning init system, we’re going to write our own incantation to start the Consul agent. Here’s what I came up with (saved to start-consul.sh).

BIND=`ifconfig eth0 | grep "inet addr" | awk '{ print substr($2,6) }'`
/usr/bin/consul agent -config-dir="/etc/consul.d" -retry-join 172.17.42.1 -bind=$BIND

We’ll want to register a service to take full advantage of Consul, so let’s create a file called /etc/consul.d/web.json (this will allow its config to be automatically picked up, since we specified its directory as the config directory):

{
    "service": {
        "name": "web",
        "tags": ["nginx"],
        "port": 5000,
        "check": {
            "script": "curl localhost:5000 >/dev/null 2>&1",
            "interval": "30s"
        }
    }
}

Next, since Docker has a single entry point for containers, let’s make a generic run.sh file that will launch both the consul agent, as well as the web server:

/root/start-consul.sh &
python /root/work.py

Give the file execution permissions, exit the container, and commit it to a new image:

docker commit 18feadc515f7 prime:v2

Start it up:

docker run -d prime:v2 /bin/bash /root/run.sh

And check that our consul server is aware of the container:

root@dockbox:~# consul members
Node          Address            Status  Type    Build  Protocol
dockbox       172.17.42.1:8301   alive   server  0.4.1  2
ea66b40276c3  172.17.0.131:8301  alive   client  0.4.1  2

Smooth. Let’s launch more!

root@dockbox:~# for i in $(seq 1 19); do docker run -d prime:v2 /bin/bash /root/run.sh; done
root@dockbox:~# consul members | wc -l
21

That includes the header line, but you get the idea. The new containers were automatically picked up by Consul. Pretty cool, eh?

Dynamic Load Balancing

Now let’s take the information we now have from service discovery and use it to automatically modify our nginx config. We’re going to use consul-template to do this.

Preparing the Nginx Template

Once you’ve followed the instructions for installing consul-template, create a template file for consul-template to work off of:

I stored this as /etc/nginx/nginx.ctmpl. This will serve as the base that consul template will use to generate the nginx config every time it detects changes to running services.

Starting Consul Template

To start consul template watching the file, run:

./consul-template -template "/etc/nginx/nginx.ctmpl:/etc/nginx/nginx.conf:service nginx reload"

You should see that the config is automatically reloaded:

* Reloading nginx configuration nginx                       [ OK ]

Looking at the new config file should confirm that it has been updated with any running containers you have. After a few seconds, having launched a few more containers and given their service checks time to pass, here’s what my nginx config shows:

upstream primes {
   server 172.17.0.152:5000;
   server 172.17.0.154:5000;
   server 172.17.0.155:5000;
   server 172.17.0.163:5000;
   server 172.17.0.152:5000;
   server 172.17.0.161:5000;
   server 172.17.0.156:5000;
   server 172.17.0.158:5000;
   server 172.17.0.168:5000;
   server 172.17.0.167:5000;
   server 172.17.0.157:5000;
}

Wahoo! This is exciting stuff. Loading up the website shows that my load balancing still works correctly. If you continue to launch containers and monitor your nginx config, you should start to see more of your containers being automatically added into rotation once their service checks start to pass.

Setting Up the Web UI

The web UI is essentially just a wrapper around all of the HTTP API endpoints, but is much less cumbersome for humans to use. If you’re following along (using the ppa I mentioned earlier), you can easily install the web-ui to get a little better insight into what’s going on:

apt-get install consul-web-ui

When you do this, a new file will be created in /etc/consul.d/ called 10-web-ui.json. If you don’t want the whole world peeking in, you’ll want to change the default binding of 0.0.0.0 to 127.0.0.1. Once you do that, you can restart consul and tunnel traffic from your machine to the web UI over SSH:

ssh root@dockbox -L 8500:localhost:8500 -N

Now you can load up the web UI (localhost:8500/ui/) and check out your containers:

All in all, not bad for an afternoon’s work!

Where to Go from Here

I hope you see the possibilities of what we’ve explored here. There are so many more things we could start to do on this box to build skills a bit more, such as:

Simulate network outages and see how our system responds.
Randomly take down containers while under a stress test and see how many requests fail. How can this number be reduced? What might be an acceptable compromise?
What is the maximum number of requests our system can handle at a time? Could we tweak some operating system settings to make this higher?
How could we tag and release different versions of our software (e.g. development, staging, and production), while having them all co-exist peacefully in their own containers?
One box can only go so far. How can we get multiple boxes running a set of containers and manage them easily?

The possibilities are endless. If you’re creative and curious enough, you can easily find a hundred more things to test out. Get a database cluster involved. Try some caching layers. Let your imagination go wild! You don’t need crazy hardware. Just your brain.

Helpful Commands

Here are some commands I found myself running regularly throughout this process.

List all containers

docker ps -a

Launch 15 containers

for i in $(seq 1 15); do docker run -d prime:v6 /bin/bash /root/run.sh; done

Stop all running containers

for i in $(docker ps -q); do docker stop $i; done

Delete all containers

for i in $(docker ps -a -q); do docker rm $i; done

Remove a container after it’s stopped

docker run --rm [...]