VictoriaMetrics: PromQL compliance

Roman Khavronenko
8 min readOct 8, 2021

--

MetricsQL is a query language inspired by PromQL. It is used as a primary query language in VictoriaMetrics, time series database and monitoring solution. MetricsQL claims to be backward-compatible with PromQL, so Grafana dashboards backed by a Prometheus datasource should work the same after switching from Prometheus to VictoriaMetrics.

Photo by Bernard Spragg

However, VictoriaMetrics is not 100% compatible with PromQL and never will be. Please read on and we will discuss why that is.

For a long time, there was no way to measure compatibility with PromQL. There was not even a fully defined PromQL specification. But, some time ago, the Prometheus Conformance Program was announced with the aim to certify software with a mark of compatibility with Prometheus — “Upon reaching 100%, the mark will be granted". The open-source tool, prometheus/compliance was created to check for compatibility.

Compatibility is measured in quite a simple way— the tool requires a configuration file with a list of PromQL queries to run, a Prometheus server to use as a reference and any other software meant to be tested. The tool sends PromQL queries to both Prometheus and the tested software, and if their responses don't match — it marks the query as having failed.

Compliance testing

We ran compatibility testing between Prometheus v2.30.0 and VictoriaMetrics v1.67.0 and got the following result:

====================================================================
General query tweaks:
* VictoriaMetrics aligns incoming query timestamps to a multiple of the query resolution step.
====================================================================
Total: 385 / 529 (72.78%) passed, 0 unsupported

According to the test, VictoriaMetrics failed 149 tests and was compatible with Prometheus by 72.59% of the time. Let’s take a closer look at the queries that failed.

Keeping metric name

According to PromQL, functions that transform a metric's data should drop the metric name from the result, since the meaning of the initial metric has changed. However, this approach has some drawbacks. For example, the max_over_time function calculates the max value of the series without changing its physical meaning. Therefore, MetricsQL keeps the metric name for such functions. It also enables queries over multiple metric names: max_over_time({__name__=~"process_(resident|virtual)_memory_bytes"}[1h]). While in PromQL such query fails with vector cannot contain metrics with the same labelset error.

Hence, test suit functions like *_over_time, ceil , floor , round , clamp_* , holt_winters , predict_linear in VictoriaMetrics do intentionally contain the metric name in the results:

QUERY: avg_over_time(demo_memory_usage_bytes[1s])
- Metric: s`{instance="demo.promlabs.com:10002", job="demo", type="buffers"}`,
+ Metric: s`demo_memory_usage_bytes{instance="demo.promlabs.com:10002", job="demo", type="buffers"}`,

There were 92 (~17% of 529 tests total) such queries in the test suite which failed because the metric name is present in the response from VictoriaMetrics, while the values in the response are identical. VictoriaMetrics isn't going to change this behavior as their users find this is more logical and rely on it.

Better rate()

rate and increase functions are some of the most frequently used functions in PromQL. While the logic behind these two is relatively simple and clear, the devil is in the details.

MetricsQL intentionally has a slightly different implementation of rate and increase . It takes into account the last sample on the previous interval which allows capturing all the information from the time series when calculating rate or increase:

VictoriaMetrics captures the last data point from the previous interval when calculating increase()

Prometheus, in this case, loses the metric increase from the last sample in the previous interval and the first sample of the current interval:

Prometheus accounts only for points captured by the interval when calculating increase(), losing the data that was before it.

Additionally, MetricsQL automatically increases the interval in square brackets (aka lookbehind window) if there aren't enough samples in the interval for calculating rate and increase. This solves the issue of unexpected "No Data" errors when zooming in.

MetricsQL doesn't apply extrapolation when calculating rate and increase. This solves the issue of fractional increase() results over integer counters:

increase() query over time series generated by integer counter results in decimal values for Prometheus due to extrapolation.

It is quite important to choose the correct lookbehind window for rate and increase in Prometheus. Otherwise, incorrect or no data may be returned. Grafana even introduced a special variable $__rate_interval to address this issue, but it may cause more problems than it solves:

  • Users need to configure the scrape interval value in datasource settings to get it to work;
  • Users still need to add $__rate_interval manually to every query that uses rate;
  • It won't work if the datasource stores metrics with different scrape intervals (e.g. global view across multiple datasources);
  • It only works in Grafana.

In MetricsQL, a lookbehind window in square brackets may be omitted. VictoriaMetrics automatically selects the lookbehind window depending on the current step, so rate(node_network_receive_bytes_total) works just as rate(node_network_receive_bytes_total[$__interval]). And even if the interval is too small to capture enough data points, MetricsQL will automatically expand it. That's why queries like deriv(demo_disk_usage_bytes[1s]) return no data for Prometheus and VictoriaMetrics expands the lookbehind window prior to making calculations.

There are 39 (~7% of 529 tests total) queries (rate, increase, deriv, changes, irate, idelta, resets, etc.) exercising this logic which cause the difference in results between VictoriaMetrics and Prometheus:

QUERY: rate(demo_cpu_usage_seconds_total[5m])
- Value: Inverse(TranslateFloat64, float64(1.9953032056421414)),
+ Value: Inverse(TranslateFloat64, float64(1.993400981075324)),

For more details about how rate/increase works in MetricsQL please check docs and example on github.

NaNs

NaNs are unexpectedly complicated. Let's begin with the fact that in Prometheus there are two types of NaNs: normal NaN and stale NaN. Stale NaNs are used as "staleness makers" — special values used to identify a time series that had become stale. VictoriaMetrics didn't initially support this because VictoriaMetrics needed to integrate with many systems beyond just Prometheus and had to have a way to detect staleness uniformly for series ingested via Graphite, Influx, OpenTSDB and other supported data ingestion protocols. Support of Prometheus staleness markers was recently added.

Normal NaNs are results of mathematical operations, e.g. 0/0=NaN. However, in OpenMetrics there is no special meaning or use case for NaNs.

While NaNs are expected when evaluating mathematical expressions, it is not clear how useful they are for users, or if there are any benefits to return NaNs in the result. It looks like the opposite is true because users are often confused with the received results.

MetricsQL consistently deletes NaN from query responses. This behavior is intentional because there is no meaningful way to use such results. That's why testing queries such as demo_num_cpus * NaN or sqrt(-demo_num_cpus) return an empty response in MetricsQL, and returns NaNs in PromQL.

There were 6 (~1% of 529 tests total) queries in thetest suite expecting NaNs in responses: sqrt(-metric) , ln(-metric) , log2(-metric) , log10(-metric) and metric * NaN .

Negative offsets

VictoriaMetrics supports negative offsets and Prometheus also does as well starting with version 2.26 if a specific feature flag is enabled. However, query results are different even with the enabled feature flag due to the fact that Prometheus continues the last value of the metric during the additional 5min:

VictoriaMetrics vs Prometheus negative offset query. VictoriaMetrics response value is shifted by 1e7 to show the difference between the lines visually. Without this shift, they are identical except the last 5min.

Such behavior was unexpected to us. To get more details about it please check the following discussion:

VictoriaMetrics isn't going to change the logic of negative offsets because this feature was released 2 years before Prometheus did it and users rely on that.

There were 3 (~0.5% of 529 tests total) queries for -1m, -5m, -10m offsets in the test suite:

QUERY: demo_memory_usage_bytes offset -1m
RESULT: FAILED: Query succeeded, but should have failed.

Precision loss

VictoriaMetrics fails the following test case:

QUERY: demo_memory_usage_bytes % 1.2345
Timestamp: s"1633073960",
- Value: Inverse(TranslateFloat64, float64(0.038788650870683394)),
+ Value: Inverse(TranslateFloat64, float64(0.038790081382158004)),

The result is indeed different. It is off on the 5th digit after the decimal point and the reason for this is not in MetricsQL but in VictoriaMetrics itself. The query result isn't correct because the raw data point value for this specific metric doesn't match between Prometheus and VictoriaMetrics:

curl http://localhost:9090/api/v1/query --data-urlencode 'query=demo_memory_usage_bytes{instance="demo.promlabs.com:10000", type="buffers"}' --data-urlencode 'time=1633504838' 
..."value":[1633504838,"148164507.40843752"]}]}}%
curl http://localhost:8428/api/v1/query --data-urlencode 'query=demo_memory_usage_bytes{instance="demo.promlabs.com:10000", type="buffers"}' --data-urlencode 'time=1633504838'
..."value":[1633504838,"148164507.4084375"]}]}}%

VictoriaMetrics may reduce the precision of values with more than 15 decimal digits due to the used compression algorithm. If you want to get more details about how and why this happens, please read the "Precision loss" section in Evaluating Performance and Correctness. In fact, any solution that works with floating point values has precision loss issues because of the nature of floating-point arithmetic.

While such precision loss may be important in rare cases, it doesn't matter in most practical cases because the measurement error is usually much larger than the precision loss.

While VictoriaMetrics does have higher precision loss than Prometheus, we believe it is completely justified by the compression gains our solution generates. Moreover, only 3 (~0.5% of 529 tests total) queries from the test suite fail due to precision loss.

Query succeeded, but should have failed

The following query fails for PromQL but works in MetricsQL:

QUERY: {__name__=~".*"}
RESULT: FAILED: Query succeeded, but should have failed.

PromQL rejects such a query to prevent database overload because query selects all the metrics from it. At the same time, PromQL does not prevent a user from running an almost identical query{__name__=~".+"} , which serves the same purpose.

The other example of a failing query is the following:

QUERY: label_replace(demo_num_cpus, "~invalid", "", "src", "(.*)")
RESULT: FAILED: Query succeeded, but should have failed.

The query fails for PromQL because it doesn't allow using ~ char in label names. VictoriaMetrics accepts data ingestion from various protocols and systems where such char is allowed, so it has to support a wider list of allowed chars.

There were 2 (~0.3% of 529 tests total) queries that failed because of incompatibility but we can’t imagine a situation where it would harm a user’s experience.

Summary

There are differences between MetricsQL and PromQL. MetricsQL was created long after the PromQL with the goal of improving the user experience and making the language easier to use and understand.

How compatibility is measured in the Prometheus Conformance Program isn't ideal because it really only shows if the tested software uses Prometheus PromQL library under the hood or not. This is particularly complicated for solutions written in programming languages other than Go.

By the way, the percentage of failing tests is easy to increase or decrease by changing the number of range intervals (e.g. 1m, 5m etc.) in tests. In the case of VictoriaMetrics, about 90 tests have failed not because of wrong calculations, but because of the metric name present in the response. Of course, there is no ideal way to be fair to everyone. That's why this post exists to explain the differences.

We also want to say a big thank you to Julius Volz, the author of these compliance tests. Thanks to his work and patience we were able to fix most of the real incompatibility issues in MetricsQL.

--

--

Roman Khavronenko

Distributed systems engineer. Co-founded VictoriaMetrics.