Intuition Engineering with Docker

This post presents a suggestion for a docker monitoring solution combining two awesome tools already at your disposal: Netflix Vizceral and your brain. We bring them together and adapt to a Docker environment. Scroll down for technical details.

Your phone rings. It’s not the usual tone, though. Nope, you quickly realise it’s your “best friend”, PagerDuty, nagging you again.

Ah, the standard routine: acknowledge the incident, open dashboards, acknowledge more calls, head to slack to alert everyone, start bringing up logs…

I’ve had my fair share of situations like this. Working at Prezi’s Infra team we had to make sure the core infrastructure was always ready to serve our 75 million users. This meant several hundred servers and a few dozen micro services had to seamlessly work together. But there is this Murphy guy: things will inevitably go wrong at some point.

Sure, we try to build highly available systems. Systems where an outage of one service can be tolerated by the rest. But trust me, you can’t prepare for everything. And … It’s ok, it’s fine. You shouldn’t. Unless we are building rocket control systems, bugs are ok. But we should have tools to understand what’s happening in our systems.

Why is it so hard to find the root cause of an outage right now?

We’ve built it, we know it. It’s our precious. We rigorously instrumented our code to expose a wide variety of metrics. We have all the tools at hand to get data about our system. Still, when the alerts come, we are not really sure what to check first.

It’s called analysis paralysis. Obviously, we want to fix the problem as soon as possible. We want to figure out what’s the best place to look for issues. But this takes time because current tools don’t help us get to an intuition of our system. Sure, they expose gazillion very precise metrics. Yet, the best intuition we can get from them is whether a single value is considered green, orange or red.

How Netflix does Intuition Engineering?


Netflix recently started talking about something called Intuition Engineering. Their use case is simple. They want you to enjoy your movie, even if an entire Amazon region dies. When that happens, they very carefully need to redirect the traffic to the other regions. Every step in this process must be flawless. They can’t wait for the first alert to let them know they screwed up. They can’t check 100 different metrics after each and every step. And of course, they don’t need to know the average heap size in a certain region. No, instead they want to have a single dashboard, where their engineers can have an intuition of how the traffic flows through their systems. They have an idea of what every step should do, and they need to make sure that it really happens.

Intuition engineering for the rest of us

We surely aren’t at Netflix scale. Yet clustering solutions and modern microsystem methodologies have made monitoring and reasoning about our systems much more difficult. We feel the need for similar observation tools for our systems.

For example, there are now great options to orchestrate our clusters: Docker 1.12 has these features built right in, Kubernetes and Rancher is there for you if we need something more established and of course we can also take Amazon’s Elastic Beanstalk since we are already a big AWS user.

On the other hand, if we look around the monitoring tools market today, there aren’t any good solutions tightly integrated with the technologies above. New Relic, Ruxit, DataDog or WeaveScope. They are all powerful tools. We can dig deep in all our metrics, visualize them on nice dashboards and even get alerts. There is one thing we miss though: understanding the behaviour of our entire system just by looking at a dashboard.

Like it or not, this is a brand new era for designing systems. If our engineering team is big enough, our architecture will change faster than we can follow. We need a new crop of tools to help us intuitively understand our systems.

A simple Intuition Engineering as a Service prototype

We need real time information on not just how our services behave on their own, but we need a lot of data on their relationships. We wanted to

  • have an up-to-date knowledge of our system architecture,
  • see how errors propagate from one system to another,
  • spot bottlenecks, latency issues in a glimpse.

I proudly present the first, very simple working prototype of an idea which I’ve built together with a few bright guys from Founders. As we started to think more and more about this, we’ve come up with a few core concepts:

  • Containerized services should be first class citizens. You care about your services, not about your hosts.
  • It should be up and running even on a complex infrastructure in minutes. No special cumbersome configuration should be necessary. You should focus on your business, not on monitoring.
  • Any kind of containers should be supported. Even ones which are pulled from Docker Hub or any other registry and you don’t have any control on their internals.
  • Manual instrumentation of applications shouldn’t be necessary. Everything should be done automatically, out of the line, without interfering with the application code at any level.

Check out a video of our current simple prototype, which shows real-time traffic on Docker’s example voting app demo.

See in production:

Our prototype is very simple for now, but it already satisfies all the requirements mentioned above.

It provides us with valuable information during a debugging session or while onboarding a new engineer.

It shows you how the traffic flow through your system. It can pinpoint service and connection errors as well as different bottlenecks. All that done completely automatically, without any configuration.

The installation is quite trivial. We run a lightweight agent on every node we have in our cluster. The agent runs in a container and it has two main responsibilities:

  • It collects generic information about the host (containers, services, networks, IP addresses, etc…).
  • It spawns new containers to analyze the traffic of each container we run (right now we use Elastic’s Packetbeat for this).

We aggregate separate containers into services, based on the information we collect from the Docker engine. Packetbeat works on the TCP level, it intercepts TCP packets and parses HTTP, Redis and SQL requests and stores them in an Elasticsearch storage. From these two streams of data, we can pass the right information to Vizceral to visualize your cluster.

We don’t just show the quantity of the requests being sent, but we also visualize their quality. When things are stable we see white dots travelling between services and we’d see red or orange dots when there are faulty requests.

Also, we have a few other ideas to enhance this prototype in the future. Just a few of them to pique your appetite:

  • We want to show some basic metrics about the service itself. You should know if any service doesn’t behave as it should.
  • We want to show access logs on links. You should be able to quickly dig deeper right from this UI and see what’s actually behind these dots.
  • We want to show details on each and every container behind a service.

Test the prototype on your own

Enough talking. The good part is that you can actually bring the prototype up on your machine or AWS and just play with it a bit.

1| Set up a Docker cluster. We’ve built support for Docker 1.12 Swarm mode for now, so first you’d need a Swarm cluster. We have prepared a simple step-by-step guide to get such a cluster on Virtualbox or on AWS. Of course, if you are part of the Docker for AWS/Azure private beta program, it’s even easier for you.

2| Bring up services. You can deploy whatever you prefer (we encourage you to play with this actually, the Docker 1.12 orchestration is super easy to use). If you want to deploy Docker’s voting example, just follow these steps:

# Start a database instance
docker service create --name db -p 5432:5432 postgres
# Create a Redis instance
docker service create --name redis -p 6379:6379 redis:alpine
# Start the voting frontend
docker service create --name vote --replicas 2 -p 5000:80 ghoranyi/docker-example-vote
# Start the vote processing worker
docker service create --name worker --network ingress ghoranyi/docker-example-worker
# the `--network ingress` above is needed to attach to the default overlay network instead of bridge even if it doesn't expose any ports
# Start the results frontend
docker service create --name result -p 5001:80 ghoranyi/docker-example-result

Test your app in the browser. Open port 5000 and 5001 on any of your Swarm nodes (Docker’s new routing mesh feature takes care of redirecting your request to a node which actually has the container running).

3| Deploy the backend. This is only for the prototype. On the long run, this would be hosted on our side, you shouldn’t have to worry about this.

docker service create --name vizdemo-backend -p 8080:8080 -p 8878:8878 -p 9200:9200 ghoranyi/docker-intuition-backend

4| The final step. This is the only step you’d need to do on your infrastructure: install the agent with this single command.

docker service create --name docker-agent --mount source=/var/run/docker.sock,target=/var/run/docker.sock --mode global --network ingress ghoranyi/docker-agent

Now you should open your browser and generate some traffic on your app. Then head to the UI to get the overview. Feel free to bring up new services, as soon as there is traffic on a service, it should show up on the UI. Alternatively, you can check out the UI just by clicking here.

Brace yourselves, MVP is coming

We think it’s a shame that there are no out of the box solutions for this right now. We want to change this.

But first, we want to understand if this makes sense to any of you out there. That’s why, we are actively seeking more companies currently running on Docker.