Sysdig — Why good tools really matter

Trevor Samaroo
Jun 26 · 8 min read
A good tool goes a long way

Enter Sysdig

While standing there looking at huge screens showing what the tool can do, I remembered my first encounter with Sysdig. In December 2015, I attended a Kubernetes conference in NYC called Tectonic (by CoreOS). I was really excited about containers and very interested in Kubernetes. It was gaining momentum, but it was still unclear where things would go.

15/15 Vision

Take off your glasses or your contact lenses. Thats what you can see on your clusters. Now put them back on. Thats what I see with Sysdig. Things I didn’t know were there. Colors, textures, size, shapes. Interactions I didn’t know existed.

  • platform components, like Docker and Nomad and Consul
  • customer containers and all their activity

How can we do this?

Labels

The most important thing is to have a good labeling strategy. Schedule containers with labels that allow you to find and organize them later.

  • Host → Containers
  • Env → Host Kind → Host → Container Name
  • Namespace → Env → Container Name → Container Image
  • and so on and so forth.

Dashboards

Random screenshot, since I can’t show my actual clusters

There’s More

There’s a lot more to Sysdig than just this (and in fact, there’s Sysdig Secure which brings more visibility goodness and security). There’s system captures and seeing “back in time” and inside the kernel and many other useful features.

A capture of all activity in the kernel for a period of time

Problems I’ve Solved

I manage a large team, so its actually fun to jump into an issue here and there to assist. It keeps me grounded in reality. Here are some issues I have personally solved. I’m limiting it to just 3, since I’m trying to complete this blog in less than 40 mins.

What is DOSing our service

You get a call from another team — something on your platform is hitting another service heavily. Maybe 2 of your shared hosts are accounting for 60% of all DNS traffic. They give you the ip and port of their service and they give you some offending hostnames that they saw doing bad things…on Saturday. Two days ago.

Java.lang.OutOfMemoryError — Cannot create new native thread

You get clients complaining that they are getting out of memory errors in their JVM’s. You think its their app, but you must dismiss it as being a platform issue. You wonder why these apps cannot create threads. It must be host thread contention caused by someone’s container. Blast radius in effect.

Cluster Capacity and Container Rightsizing

The non prod clusters are full again. So we add more “worker” nodes. Three weeks later, we are full again. Success is not without its issues.

Summary

Good tools matter and Great tools make all the difference in the world. Use them wisely and reap the benefits. (which in this space, usually means you can sleep well.)

PS

I can’t publish screenshots, etc, unfortuntaly. Which is not doing this blog post justice, because its impressive what I could show. Also, the metrics names are approximate, I am not looking at Sysdig while typing this.

PPS

I’m not being paid here. :) This tool has improved my life and that of my team and I think the world should know about goodness, wherever and whatever that might be.