The docker age: Monitoring market survey

amit bezalel
4 min readJan 14, 2017

--

For the past year or so, I have been developing a large scale cloud product that grew from a two person operation with 3 components and a JSON file as a DB, to a big department spanning, micro-service docker deployed, kubernetes & AWS operation. While coding various demo versions for increasingly demanding spotlights, we are finally nearing the deadline of the beta release!

…But as we all know, exposing a project to customers requires having some confidence that it will stay up for a good period of time. System longevity and performance under load is never guaranteed util you exercise the system to iron out all the kinks, so a monitoring system was required.

To get a feel of the monitoring market, I set out to find some names of popular monitoring tools, some footwork and google scouring led me to composing the below feature comparison table:

I have added coloring by my own standards, and your system may differ in requirements, but as you can see the monitoring market veterans don’t seem to make the cut. It isn’t that you can’t find a way to use them, you will find that the community has some workaround or another for each non-green item on my list, but if you are looking for out of the box guaranteed flows, you better move along and test some new tools.

Looking at the left part of my feature table, one might ask: “what about SaaS offerings like DataDog?”, What about other options like heapster which is built in to kubernetes or CloudWatch that you get in your AWS account, and already holds monitoring for half of what you need like DB monitors, EC2 machines etc.?

The answer to the first question is that i am not currently in the market for a SaaS solution, rather in building one for myself in a short while (corporate bureaucracy will take too much time to sign off on a SaaS solution). For the second set of questions, Kubernetes Heapster is not a complete monitoring system so DB & other Infra metrics will not be collected by it, they endorse prometheus for when you need a real monitoring system. As for CloudWatch i have tried working with it, and the information collected there is very shallow. For example, DB statistics are provided per DB server instance, not per database and EC2 statistics lack memory stats unless you install some perl-written agent on your machine (which has a non-production code warning label on it).

After doing away with those we are left with our final trinity: Prometheus, TICK stack (influx) and Sensu.

Each have some pros and cons:

Sensu looks highly scalable, but comes at a cost of having many dependencies on infra components, each of those will require installing, clustering, tending and tuning (hard to set up & maintain), plus you will most probably choose to install InfluxDB anyway since it is required in order to get history retention, so not using TICK means you are not sure your DB can hold the load on its own and we are not talking about some Mysql / Oracle with 300 max connections, influx should be more than capable of consuming large amounts of data & handling the concurrent connections.

Prometheus has a solid backing from google and the Kubernetes community, is completely free and according to reports (and my personal experience) has one of the fastest and most efficient DBs (if not the fastest), for consuming & storing metrics. but since this is a dedicated, opensource DB it doesn’t have DB clustering, so scaling is limited to one server and HA is achieved by using two active servers, with alert de-duplication. It also has a pull based architecture (server connects to agents to collect metrics), which can be considered as a pro/con depending on your view, but helps with multi server high availability.

TICK Stack looks like a very good choice, influx being the #1 time series DB in most popularity charts and having a good set of collectors and an agile company backing it. Although it requires a license for enterprise features and DB clustering and is an extremely young project , about a year old with first commits in aug 2015, while the InfluxDB itself is 2 years old, for comparison Prometheus is 2 years old, and sensu is about 5 years old starting at November 2011.

I have continued to delve into this project by running an in depth POC for these 3 tools which will be detailed in my next post.

--

--

amit bezalel

After 17 years developing software, Coding is still a passion of mine, a hands-on SaaS cloud architect working in TS and GO & learning something new everyday.