It’s June 2018, and I’m walking around DockerCon looking for some tools to help me see what was happening in my production clusters. In production for over a year, and I knew what I didn’t know. I had only superficial capabilities to gain visibility into clusters, components and containers. With many applications on-boarding, I’m starting to get a bit worried. Its really hard to diagnose issues across the environment easily. I could not see across the layers of the tech stack in any contextualized way— beyond just containers are running right now and they’re passing health checks. No ability to really see what was happening in and across my clusters. All 75+ of them.
I wanted to see what was happening across worker nodes in a cluster. Filesystem and space usage. IO. Network activity and what was talking to what. CPU contention on platform components. Graphs of inodes and file descriptors in use. I wanted to see what master nodes in which clusters were the most overloaded. I wanted to see when our ingress routers were going to fall over and which clusters components were becoming hot spots. And of course, what containers were affecting others on the same host.
Grafana and Prometheus were a fine start, but woefully insufficient for diagnosing and troubleshooting across the platform. We created all kinds of dashboards, but it was painful to do so, and not very useful for troubleshooting and on-the-fly diagnosis. Navigation, usability, cohesiveness — just not there.
At DockerCon, I ended up at the Sysdig booth.
While standing there looking at huge screens showing what the tool can do, I remembered my first encounter with Sysdig. In December 2015, I attended a Kubernetes conference in NYC called Tectonic (by CoreOS). I was really excited about containers and very interested in Kubernetes. It was gaining momentum, but it was still unclear where things would go.
I sat there and watched the conference speakers. Scaling pods based on CPU. Cool! Taints and Tolerations. Awesome!
And then I saw top. Or rather — super top. Someone named Loris was presenting Sysdig, a command line tool that allows an operator to see into containers via a top-like interface. What Loris showed simply blew me away. It was by far the most impressive piece of technology I had ever seen. He navigated from the host to a container — looked at a network connections, cpu shares, file usage. Wait what? A mere mortal can see all that? I soon realized that was actually more impressed by Sysdig, than I was Kubernetes.
I don’t have to hire Brendan Gregg? (Google him) Did the container ninjitsu I just saw mean that I don’t have to send my team to SRE training courses and memorize a bunch of linux commands that didn’t come naturally for my developer brain — which not that long ago was primarily JVM based — and shielded from all the stuff under the JVM*.
Take off your glasses or your contact lenses. Thats what you can see on your clusters. Now put them back on. Thats what I see with Sysdig. Things I didn’t know were there. Colors, textures, size, shapes. Interactions I didn’t know existed.
Sysdig sees into the Linux kernel via a kernel module or eBPF. It can therefore see everything that is happening on a Linux box. All processes. All IO. All users, all commands, all args. All containers. Since I was running a container platform, I didn’t want to see *just* the customer containers. I wanted to see everything at each layer of the stack. The platform. The host. And I wanted to see what happened last week too. You know, when that host fell over and we had to bounce docker engine.
I want to see
- the host OS, user space and the kernel
- platform components, like Docker and Nomad and Consul
- customer containers and all their activity
Perhaps most importantly, I wanted to see the interaction between all the above. Nomad talking to docker engine. Consul healthchecking a container. At scale. At the level of granularity that I want. And grouped / aggregated in the way I wanted to see things.
How can we do this?
The most important thing is to have a good labeling strategy. Schedule containers with labels that allow you to find and organize them later.
Sysdig really takes labeling to another level. Its core to navigation and filtering at every level.
I can have labels like
datacenter=azure-east-1, zone=A, app-env=prod, org=T10, family=treasury, app=payments, service=swiftgateway, nomad_alloc_name=t10.treasury.payments.swiftgateway, cpu=0.5, mem=4
These are provided via nomad job labels or kubernetes labels.
I can view and navigate through my clusters, hosts, containers by
- Env → Cluster → Namespace → Service → Container Name
- Host → Containers
- Env → Host Kind → Host → Container Name
- Namespace → Env → Container Name → Container Image
- and so on and so forth.
You get the idea. I can group and navigate in any conceivable way and Sysdig allows me to do that. I can create a new navigation in seconds. New labels, new groupings and aggregation possibilities.
Great. Containers and hosts are labeled. But maybe the out of the box dashboards are not meeting my exact needs. I have more labels and metrics that I want to see in a given view. I want it to be custom. Here is where Sysdig really shines. Dashboards and panels are incredibly easy to create or extend. This is very important, because when you are faced with a production issue, you may not have seen before, you will want to create a dashboard on the fly. The Sysdig UI makes this simple and intuitive. Which makes me go faster. And curse much less because I am not handcuffed. The limit is my brain.
There’s a lot more to Sysdig than just this (and in fact, there’s Sysdig Secure which brings more visibility goodness and security). There’s system captures and seeing “back in time” and inside the kernel and many other useful features.
But the point of this post is not not make your thumb tired scrolling on your iPhone, go to the Sysdig site to learn more. I want to explain how I’ve found this tool to be invaluable, in the real world, and at scale for cluster operations.
Problems I’ve Solved
I manage a large team, so its actually fun to jump into an issue here and there to assist. It keeps me grounded in reality. Here are some issues I have personally solved. I’m limiting it to just 3, since I’m trying to complete this blog in less than 40 mins.
What is DOSing our service
You get a call from another team — something on your platform is hitting another service heavily. Maybe 2 of your shared hosts are accounting for 60% of all DNS traffic. They give you the ip and port of their service and they give you some offending hostnames that they saw doing bad things…on Saturday. Two days ago.
Great. What to do? I’ve got possibly hundreds of different containers on any given host and they can move between hosts. This is not a static environment.
Use or extend the Connections Auditing table dashboard. Filter by the hostnames in question, then filter by the net.connection.endpoint of the service being affected. Boom. Those are your containers. Oh — wait — lets see that from Saturday. Adjust the time range. Ok, most of those containers have not moved. But one has. Ok, I can see that and account for it and see it having the same behavior on another host. We’ve identified which app team’s containers are doing this.
i.e. What client containers are hitting a certain other service — just based on that services ip and or port.
Java.lang.OutOfMemoryError — Cannot create new native thread
You get clients complaining that they are getting out of memory errors in their JVM’s. You think its their app, but you must dismiss it as being a platform issue. You wonder why these apps cannot create threads. It must be host thread contention caused by someone’s container. Blast radius in effect.
Enter a thread dashboard. Using the system.threads metric, and with a few panels (graphs and tables), I was able to see the hosts with alot of threads being used. And for the hosts in question (and for the containers having issues), I was able to see a clear thread leakage in one client’s app that was causing thread starvation issues on our hosts. And see this across hosts and even clusters. And correlate the high thread usage with container exit events of other containers dying due to JVM OOM.
Great — problematic app container found. But what about the next client? How do we prevent this from happening again and how do we reduce the blast radius of any given container, thereby giving some level of QoS. Well, lets add a default for docker’s --pids-limit to our containers. But wait — what do we set the default to? Let's use Sysdig to look at all containers on all hosts over the last 2 weeks and see what most containers fall under…
I chose 4096 OS threads as the container default limit and added a pid-limit override in case clients needed it.
Cluster Capacity and Container Rightsizing
The non prod clusters are full again. So we add more “worker” nodes. Three weeks later, we are full again. Success is not without its issues.
But, I just know — I have a strong feeling — that clients are not sizing their containers properly. Why would they? These are mostly on prem java devs who simply deploy whatever they have to whatever VMs they have. And oh — we used a 4 core vm with 16G of memory, so we’ll size the container as that. They don’t quite get (yet) that a container is a process and not a vm. With container limits, we are sizing (constraining) a linux process. And because we are on prem, we dont have a good chargeback story to incentivize behavior. Arrg. How do we get a handle on this?
Enter two awesome metrics and our custom labels. Metrics — container memory limit utilization percent and container cpu share percent. Labels — container.cpu, container.mem. I created dashboard with a table panel that showed all containers against all hosts. Exportable as a CSV or JSON.
Kinda looks like this (medium doesn’t do tables, so bear with me here..).
env — cluster— host — namespace.service (nomad alloc_name)— container.label.cpu — container.metric.cpu_shares_%(avg) — conatiner.label.mem — container.metric.memory_limit_%(max).
Boom. Instant container utilization reports. Resize containers and double your infrastructure capacity. $Millions saved. Achievement unlocked.
Good tools matter and Great tools make all the difference in the world. Use them wisely and reap the benefits. (which in this space, usually means you can sleep well.)
I can’t publish screenshots, etc, unfortuntaly. Which is not doing this blog post justice, because its impressive what I could show. Also, the metrics names are approximate, I am not looking at Sysdig while typing this.
I’m not being paid here. :) This tool has improved my life and that of my team and I think the world should know about goodness, wherever and whatever that might be.