Collecting job metrics using Prometheus PushGateway

Published in

Peaksys Engineering

4 min readMar 21, 2023

Peaksys, Cdiscount’s tech subsidiary, offers comprehensive observability platforms that give IT teams full control over increasingly complex and distributed scopes.

With more than 2,000 micro-services and several hundred cron jobs, addressing all our observability needs is essential, whether for pure applications or batch jobs.

Our metrics management platform collects more than 1,300,000 samples (metrics values) per second and retains around 550,000 samples.

How Peaksys’s metrics platform works

Our metrics platform is built by combining Prometheus with Thanos to upscale the infrastructure.

We only work in pull mode, meaning the Prometheus instance is what collects the metrics on our hosts’ exporters and applications at regular intervals.

This method applies in 99% of cases, but there are always special cases for which it is not suitable.

When is pull-mode infrastructure not appropriate?

Using pull mode requires that the Prometheus instance be able to access the target all the time to be able to collect the metrics. This is not a problem for most applications, but it does become one for a batch or a cron job, for example.

When running a job, we have no guarantee that the metrics can be collected at the right time since it depends just as much on how long the job takes to run as to the moment when the collection takes place. A collection interval can be longer than the time a job takes to run.

We can artificially extend the run time at the end of a job, but this is not a desirable solution since it would artificially change the production’s execution plan. Increasing the collection frequency is not an acceptable solution either since it would increase the load on the CPU and the network while unnecessarily increasing the storage of metrics in most cases.

An alternative push-based solution

Our observation was simple: if our application cannot be collected in pull mode, then it must be able to send its metrics in push mode on its own. This way we can retrieve useful data at the right time without increasing infrastructure costs.

Prometheus offers a gateway called PushGateway that can receive metrics and expose them without changing the metric’s content.

By placing this gateway at the same level as an application in our architecture, we can collect it just like we would for any other target.

The gateway should be a solution of last resort since it comes with certain drawbacks:

The metrics sent over the gateway never expire. In practice, this means that the metrics are exposed until their value is changed or the metric is deleted, either manually or when the gateway is restarted. Thus, we cannot monitor the execution of a job through the metrics sent over the gateway, for example.
Normally, we would be able to check a target’s collection status and quickly see if there was a problem. With Prometheus, we cannot verify if a metric has been sent in error.

As a consequence, push mode should be limited to jobs since this is the only case where collection is not suitable.

How did we implement the solution?

First, the solution was deployed on virtual machines. We have since implemented it in Kubernetes to be more flexible in assigning resources and are currently migrating the applications.

To use the solution, implementation on the application side uses the Prometheus library to trigger dispatch to the pushgateway.

# Example of implementation in Python
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway, Counter
import os
import time
 
registry = CollectorRegistry()
 
# metrics initialization
g = Gauge('random_per_second', 'Random number to simulate gauge',["application"], registry=registry)
c = Counter("basic_count","Basic counter simulation",["application"], registry=registry)
 
while True:
    random_per_second = int.from_bytes(os.urandom(3),"little")
    g.labels("MyApp").set(random_per_second)
    push_to_gateway('localhost:9091', job='MyApp_Pushgateway', registry=registry)
 
    c.labels("MyApp").inc()
    push_to_gateway('localhost:9091', job="MyApp_Pushgateway", registry=registry)

    time.sleep(10)

We must also remember to disable the usual exposure of the “/metrics” path to be sure that Prometheus does not collect our metrics twice: first through the pushgateways and then again by pulling directly from the application.

To offer an idea of scale, we collect 35,000 samples per second through the pushgateways, which represents 2.5% of our total volume.

To conclude, this hybrid push-pull architecture via the gateway meets our needs without complexifying our metrics platform overall.

Reference:
- https://prometheus.io/docs/practices/pushing/

Collecting job metrics using Prometheus PushGateway

Written by Benjamin Ameztoy