Avoiding combinatorial explosion in continuous queries in Influx

How not to do CQs, and how to do them right

George Shuklin
OpsOps
3 min readFeb 15, 2019

--

My small development environment suddenly stops producing performance data. I looked into few places and quickly realized that it’s influx and continuous queries it it are lagging behind. Influx consumed 130% CPU and had had a backlog for about of 5 minutes (with ‘every 1 minute’ CQ, means I have at least three missed CQs.)

My initial suspicion was to a newly connected application: ‘oh no they sends too much’. The issue persisted even after I killed that application, and dropped its measurements.

So I had to dig deeper.

I omit my attempts and guesses. Here the answer: NEVER use regular expressions in CQ for measurement names.

Long explanation

The official Influx documentation provides the following example of CQ:

Example 3: Automatically downsampling a database with backreferencing

Use a function with a wildcard (*) and INTO query’s backreferencing syntax to automatically downsample data from all measurements and numerical fields in a database.

CREATE CONTINUOUS QUERY "cq_basic_br" ON "transportation"
BEGIN
SELECT mean(*) INTO "downsampled_transportation"."autogen".:MEASUREMENT FROM /.*/ GROUP BY time(30m),*
END

cq_basic_br calculates the 30-minute average of passengers and complaints from every measurement in the transportation database (in this case, there’s only the bus_data measurement). It stores the results in the downsampled_transportation database.

cq_basic_br executes at 30 minutes intervals, the same interval as the GROUP BY time() interval. Every 30 minutes, cq_basic_br runs a single query that covers the time range between now() and now() minus the GROUP BY time() interval, that is, the time range between now() and 30 minutes prior to now().

This is bullshit. Do not trust them.

If you have more than one measurement in a database, and those measurements have different fields/tags, you will have a combinatorial explosion during CQ execution.

You can see this by yourself by executing this query:

SELECT mean(*) FROM /.*/ GROUP BY time(30m),*

If you have those (many measurements), you will see a crazy output with lines and lines of field names, a very long list of tags (all tags combined!) and screens and screens of empty space with occasional values.

The reason for that it that Influx combines for SELECT part all fields (from all measurements) together, and do grouping by all tags (from all measurements).

Somehow INTO part of CQ deals with this and places everything as it should be. Nevertheless, there is a combinatorial explosion (Cartesian product?) for all possible values. Even on very fast machines it takes eterninty to crunch this madness. And it getting exponentially worse with each next measurement you add (a few more items into telegraf plugin list). That’s why I hadn’t notice this eariler. I added just four new plugins, and my test installation dropped from respectable ‘everything is OK with CPU < 10%’ down to miserable ‘lagging behind and hogging CPU’. Just four simple plugins! I glad it had happened in my lab, and not in a production.

What to do?

Answer is a bit sad, but it allow to tame CQs. Instead of few CQs handling all measurements in database, create a per-measurement CQ. With few retention policies it cause another combinatorial explosion, but it’s much more contained (as it’s a product of usually fixed number of retention policies and o(n) from number of measurements).

The main downside of all that is that you loose ability to automatically process new measurements with a single CQ. Each new measurement need few more CQs.

--

--

George Shuklin
OpsOps

I work at Servers.com, most of my stories are about Ansible, Ceph, Python, Openstack and Linux. My hobby is Rust.