Hammering nails into Kapacitor coffin

TICK is dead

Published in

OpsOps

7 min readOct 27, 2019

Sometimes there are sad moments when you realize you’ve chosen the wrong tech. It takes time to internalize a new shiny tech, to get some expertise in it. That expertise later yield only one meaningful insight: abandon this.

So, my expertise in Kapacitor yield a conclusion: don’t.

I won’t tell the whole story how we got into TICK (TICK is Telegraf -> Influx -> Kapacitor with some underperforming dashboard named ‘Chronograph’). The thing is, that we’ve started to use it. Few iterations later I can summarize why Kapacitor is so bad. The very short answer is: it critically under-performs, and there are to many architectural flaws to tolerate.

The long answer

Reason #1: Broken schema and visible partial transactions

Influx (the database underneath Kapacitor) has schema. You have some fields, which are mandatory to fill. You can’t just use a ‘document-oriented’ approach, throwing random data fields into each next record. Each record have some filed set. That is InfluxQL promise and it’s interface. If you have a bool field, it’s either true or false.

The first part of the problem is that this rule is not enforced for ‘linear protocol’ which is used in-between Influx and Telegraf. Telegraf can insert data into Influx, therefore violating the scheme, providing partial insertion. Normally, Telegraf would update data with missing fields almost instantly (next line in ‘line protocol’ would fill the gap). There is an instant moment when schema is broken. You have a boolean field which is neither true nor false. If we make a normal query into Influx it’s not a big deal. May be it causes some super-rare minor glitches for some queries, but, gosh, this is time-series. One data point is not a big deal. Except, when the second part of the issue comes into play: Kapacitor’s stream processing.

When you have a partial update into Influx via a line protocol, this update is propagated into all subscribers, including Kapacitor ‘as is’. This creates a data entry in a Kapacitor pipeline. Moreover, next partial update (to finish writing from telegraf) would create a next data entry in Kapacitor. Both entries are incomplete.

Incomplete entries causes one of two crazy things:

You have an expression lambda x: x==True or x==False, which is False for boolean field for some data entries (logic dies somewhere here). Good luck debugging this.
Or, you are getting errors because your entry lacks some fields. You expect to have both foo and bar, but because some entries are not processed, there is no foo or there is no bar, and, sadly, this error may be somewhere inside | alert node (which means you have no alert in the situation you want to have that alert).

The solution for this is to use | window() to have your data entries merged based on the same timestamp. This is big gotcha and an excellent example of defensive programming in action.

Reason #2: Hidden state you can’t inspect

Each data entry may create a new group, which can have a state. Deadman a good example: we create a group with a state which reacts on ‘lack of data’ after some time.

When you write a complicated pipeline, you may want to see what state is there. You can’t. All you heave a measly ‘| log’ node which can print into log, but only things passing through (data entries). You can’t print the state. You can’t see the state. But it is there (may be), and it’s presence defines behavior of your code. Something defines behavior of your code but you can’t see it. Good luck with debug. No defensive programming strategy is available to shield against this.

Reason #3: Time as a global side effect you can’t control

Kapacitor is heavily relies on wallclock time. Half of nodes are related to some time functions: window, non_negative_derivative, shift, cumulative, etc. Moreover, time is used for joining groups.

Yet, you have no control over time for tests, nor for debugging. If you have ‘window’ for 30 minutes, your result would be there in 30 minutes.

… I once debugged a piece of code with a window of 24 hours. It took me six iterations to fix, one week of wallclock time (including some of my personal weekend time), because, em… window is 24hr… And there is a hidden state for unknown number of groups you can’t inspect, and you can’t move wallclock time as you want, so, 24hr per iteration.

Reason #4: Programs with no versions

You absolutely need to acknowledge that tick script is a program. We have best practices… no, the basic rules of sanity on how to work with programs. Program should have version, and Kapacitor provides no easy way to have it. You have no ‘version’ field in the tickscript. Moreover, you can’t use hash of the program code as version because code you load into Kapacitor and code in ‘show’ of Kapacitor’s output is different (prettyprint and so on).

The defensive programming give you a hand: put a special let VERSION=’version’ in the program to get it’s version by grepping it code for VERSION= code. /facepalm.

Reason #5: tick templates are incompatible with any configuration management software

Kapacitor provides some sanity in managing scripts: you can have a task which is based on template and some set of variables. Few tasks can share the same template. You ship template as a code (no templatization in Ansible) and you write yaml-files with variables as configuration files for that code.

Gotcha? There is no proper way to update this thing. When you upload new code you have two options:

Put code (template) first, variables after
Put variables first, template after

Both does not work. If you add variable into your code, your attempt to upload new code fails because ‘undefined variable’ (there is no such variable in the variables for task). You need to set up variables before template.

If you remove variable from your code, you need to upload template first (or new variables would fail as there is no ‘old variable’ for old version of code).

So, you can’t manage this damn thing with any automation tool. Your single approach would be ‘drop everything, re-upload’. Which would have been a reasonable solution if you hadn’t had a state in your application which is lost when you delete a task. And, some states are very long to refill (remember, 24hr window!). That means you can’t do this every run of your automation tool. You may fall back into ‘compare versions and do reload if changed’, but… reason #4 — there is no way to know what code was loaded (except for manual VERSION thing). Moreover, there is a problem with handling dependencies between tasks and templates. Your Ansible playbook become a thousand-lines monstrosity it was never designed to become. Even if you write a Python module for that, it’s still a monstrosity full of hacks, quirks and you spend more time on maintaining this thing than on original tick scripts, which are undebuggable (see reasons 3 & 4).

Reason #6: It’s a programming language with no libraries or modules

If you ever would try (you shouldn’t) to create something viable from Kapacitor (f.e. some mild version of ‘business rules’ of Shinken which allows you to have a different alerts based on totality of the observed disaster), you find yourself with almost assembler-like language where you can’t use modules, functions (you can reuse expression lambdas, and that’s all), etc. Everything need to be written from scratch in every next task or template.

You may be tempted to use loop-back, or some other hacks, and yes, to some extend you get some reasonable code reuse. But…

Reason #7: Complete lack of types for interfaces

I said it looks like assembler, and here is why: You have some type protection for elementary types (str vs list of floats) but you have zero protection for calling conventions between nodes and pieces of code (those pieces even have no special name). If one expression forget to put something into data entry, and other expression relies on that, you will know about this by having no alerts when it breaks. Not before. Given that time is glued to wallclock, and you have time-dependent state which affects your code, good luck to test all branches and cases.

Reason #8: NO TESTING FRAMEWORK

You have a complicated time-based tree-like data processing pipelines wotj hidden state and you can’t test them. There is not a single working testing framework for Kapacitor.

kapacitor-unit is a good attempt to create one, but it fails to reset that nasty hidden unobservable state between tests. That means, your tests are doing something, and one test is affected by state caused by previous test (worse, it can be affected by old code you’ve rewrote already). Good luck dealing with this.

And, did I forgot to mention that time is glued to wall-clock? Your unit-test to test deadman with threshold of 24hr would take about … 24 hours to complete.

Quantitative test for applicability

I realized I need to evade Kapacitor at the moment when I though about what to use as a monitoring for my next project. I realized I absolutely don’t want to see it in a new promising project. This is quantitative test for any software. You may continue to use it (because replacing it causes more troubles than maintaining), but if you don’t want to see it in a new project, than this software is dead.

May be each of those problems can be addressed and fixed. But I suspect, some of them would require a big rewriting. Joined together I’d argue it’s easier to rewrite from scratch (with better DSL and better data model) then to fix those.

Kapacitor is dead. What’s next?

I won’t say I know the solution. Currently I’m trying to stitch together Prometheus and AlertManager, but I don’t know if they can deliver the values I want. Old school monitoring (circa Nagios) was annoying but they deliver an amazingly operational monitoring systems (check dependencies, scheduled flexible downtimes, great visibility). They have too much rigidity for modern systems, and $HOST$ interfaces into scripts was too much of a trouble (think of escaping for $ARGUMENT$ to be passed through ssh into remove script). I hope Prom would help, but for now I don’t know.