How we managed to monitor scheduled events in our system

Published in

payu-engineering

11 min readOct 13, 2022

This blog post aims to share a problem we encountered while implementing SLOs to measure scheduled events in our reporting system, and how we managed to solve it

Introduction

It hasn’t been that long ago that terms such as Prometheus, Grafana, SLIs/ SLOs, Metrics, and Alerts were just buzzwords to me, but then I received a task that made it all clear.

It’s been a while that I’ve been working on different components in our reports system, and I started noticing a few reliability issues:

Currently, we won’t be aware if a scheduled report isn’t created on time/ not created at all for a customer.
We experienced report failures in production due to frequent changes to our queries.
We had to improve our visibility using metrics to measure the success/ failure of reports.

I quickly raised a red flag. In the next quarter, I was assigned multiple tasks to solve them, a taste of my own medicine you can call it.

Surprised the tasks were prioritized so quickly, I was extremely motivated to start digging into these issues.

Motivation

Reports play a crucial role in the reconciliation process for our customers. By using reports we give our customers the ability to review their funds and transactions over a specified period of time.

Furthermore, it enables our customers to analyze unauthorized transactions/frauds, accounting errors, payment status/amount mismatches, and more.

These reliability issues had to be addressed due to the importance of the reports to our customers.

We decided to implement thorough end-to-end tests and create multiple alerts for better visibility in our production environment.

Tests were the easy part, but in order to implement alerts — we had to gather some metrics first. Usually, metrics are pretty straightforward, but when talking about measuring metrics of scheduled events (that can occur weeks/ months from now), this is where it gets a bit tricky — but we were up for the challenge.

Despite spending countless hours searching the internet for a lead on how to measure metrics of scheduled events, I came up short. It was very weird since I thought to myself — Big companies must have scheduled events in their system, how do they monitor it? I didn’t find out, so we had to design our own solution.

Terminology

Before diving in, let’s start with a few terms we need to understand.

Reports

We have two ways to generate reports in our system:

Ad-hoc (on-demand) reports — Can be generated with a click of a button and are received within a few minutes.
Scheduled reports — Being created via a scheduled async flow where you can specify the frequency of the report.

The frequency of a scheduled report can be daily, weekly or monthly, and will be delivered to the user at 1 AM depending on the timezone specified.

We allow the user to specify exactly on which day he would like to receive the report — if it’s monthly or weekly by picking the day of delivery, if it’s daily — well you guessed it, it will be generated automatically every day.

schedule report creation through the control center

ad-hoc (on-demand) report creation through the control center

SLOs — Service Level Objectives

SLOs are individual promises we make to a customer. SLOs are what set customer expectations and tell us what goals we need to achieve and what we need to measure ourselves against when providing a service.

Reporting-API

Reporting-API is the service in charge of generating ad-hoc and scheduled reports.

Cronus

Named after the Greek god of time ⌛ —This service is being used to create recurring schedules, such as scheduled reports.

The issue

We decided that our SLO for scheduled reports will be as such -

99.9% of the scheduled reports will be generated at the expected time they were scheduled to, over a period of one month

How can we calculate such a thing?

We had to separate this into two sub-tasks:

Identifying how many scheduled reports have started in the previous hour
Identifying how many scheduled reports were expected to be generated in the previous hour

By extracting this data, we will be able to determine if we “hit” our SLO with a simple calculation -

( 1 - ( absolute( SRS - SRE ) / SRE ) ) * 100

SRS — Scheduled Reports Started
SRE — Scheduled Reports Expected

In other words, out of 100%, we will deduct the percentage of scheduled reports that were expected, but were not generated on time- which will give us the exact SLO we wanted, since now we received the complement result which is the number of scheduled reports that were generated at the expected time.

Initial Metrics

Our first step in calculating our target SLO, was to export common metrics of reports inReporting-API. By adding counter metrics using Prometheus at the beginning and end of report generation flows, we were able to measure how many ad-hoc/ scheduled reports started and in which status they finished.

In other words, every time we have an incoming request to create a report (ad-hoc/ scheduled) we raise the counter by one with the relevant identifiers.

Due to the counter metrics we implemented at the start of a report flow, we are now able to determine from incoming requests to Reporting-API, how many scheduled reports have started at a specific hour - which is the SRS parameter.

The following is an example of the Reporting-API metrics when querying the /metrics endpoint exposed by the Prometheus client.

we care about **reports_started_flow_counter**’s that they’re **is_scheduled** equal to “1”, “0” means ad-hoc reports

In this example, we can see that there are 284 scheduled payment reports, and 224 scheduled settlement reports that have started in Reporting-API .

Scheduled Reports Expected

The next step was to measure SLOs specifically for our scheduled reports.

Identifying how many of our scheduled reports needed to be sent in the previous hour and comparing that to how many scheduled reports have actually started in the previous hour (in Reporting-API) was essential.

By doing that we could improve our system reliability, knowing exactly what happens to our scheduled reports and how many mismatches we have.

Let’s discuss how we’re going to calculate the SRE parameter.

Wait, what?

Identifying how many scheduled reports were expected to be generated in the previous hour

How can we do that?

The first thought we had was that we could get that information from Cronus itself since it already has all of the scheduled reports’ data.

On the other hand, we didn’t want the component in charge of sending the requests to generate a scheduled report to manage the expectations due to a few reasons:

If a scheduled report request does not occur, what will happen? Cronus pulls out requests from a queue and sends them to a target service — what happens if a scheduled report request didn’t reach the queue,Cronus won’t be aware of it.
It’s not Cronus’ purpose to manage the scheduled reports’ expectations.

Eventually, we decided to implement a separate service, whose sole purpose is to manage expectations of scheduled events (including reports) named SRMS.

SRMS — Scheduled Resources Monitoring Service

The service’s main purpose is to export metrics once an hour, of what scheduled events were expected to occur based on events specifications in our system such as reports, and everything that is scheduled in our system.

Let’s take a look at how SRMS is implemented. It has a dedicated topic which it consumes messages from and writes the data into its database.

Once an hour k8s cron-job triggers the SRMS via a dedicated endpoint — then it gathers all the scheduled events that were expected to occur in the previous hour in UTC (all schedule expressions are saved in UTC) by type and publishes the metrics by using Prometheus.

How does the magic happen?

Cron expression holds a significant role in solving our issue.

By using a cron expression we can determine when a scheduled event needed to occur.

As we elaborated earlier, a scheduled report is sent at 1 AM depending on the timezone of the client and can be set as daily, weekly, or monthly — by using a cron expression we can describe exactly when a scheduled report is expected to be sent to the client.

💡 When we save a scheduled event in SRMS database we save it in UTC

weekly schedule on the left, and monthly schedule on the right

SRMS topic: how does a message arrive?

service-x publishes scheduled events messages to its topic.

An event in our case will be a scheduled report resource and its details, then a dedicated kafka-stream-x consumes from that topic and transforms the message to its desired form, and then publishes the new message to the SRMS topic.

A scheduled reports message, for example, is being mapped as such:

The original message on the left is transformed into the message on the right which sent to the SRMS topic

How was this schedule_expression created?

Well, we know that all our reports are scheduled for 1 AM depending on the client’s timezone. We create a cron-expression corresponding to that time in UTC (while taking into consideration the time-zone offset) when the scheduled report needs to be sent.

In Africa/Abidjan the UTC offset is zero, therefore we don’t need to align the time from 1 AM, so it stays 1 AM. This is a daily scheduled report so eventually, we get 0 1 * * *— “At 01:00” (Every day).

💡 The TTL of a scheduled resource will also be calculated if a message contains an end_date field

How does SRMS look with an integrated service that uses it?

💡 service-x can be considered as Reporting-API

As we said service-x receives a create/ delete scheduled resource request, then it publishes a Kafka message to its events topic, that message is then consumed by service-x-kafka-stream that parse the message and produce it back into the srms-topic, and then SRMS service does its magic.

The SRMS metrics — Expectations

Like we described earlier, when the SRMS is being triggered by the k8s cron-job we parse each record and decide if it was supposed to occur in the previous hour by its schedule expression, this can be done by using an npm package named cron-parser which helps us to extract the previous timestamp a record needed to occur.

After we extracted the timestamp of the last time a record was supposed to be triggered, we can now determine if we need to include it as part of the expectations results or not.

Now that we have all the records that were supposed to be triggered by Cronus in the previous hour, we group and count them by resource type, and label, and export the metrics to Prometheus.

Eventually, we will receive a response as such:

This means there are 14 settlement reports and 9 payment reports that were supposed to be triggered in the previous hour. In total for scheduled reports, the SRE (Scheduled Reported Expected) parameter equals 23.

The only thing left is to compare those results with how many scheduled reports have started in Reporting-API (SRS - Scheduled Reported Started).

Putting the pieces together

We managed to identify how many of our scheduled reports were expected to be sent in the previous hour (SRE), and we know how many scheduled reports have started in the previous hour (SRS), now we will combine the results using a query to run in Grafana(The query goes to Thanos).

💡 We decided that our source of truth will be the SRMS expectations (SRE) which is why we divide the absolute diff by the SRMS expectations

The query on top gave us the following graph:

In this graph, we can see that during the last day/ week 100% of our scheduled reports were sent on time to our clients

If we dive into reports, we will see that 44/44 scheduled reports were created as expected during the last day — while 1/1 were sent as expected in the last hour

The next step was to implement alerts, so every time the following equation result will be below 99.9%, we will receive an alert in our slack channel.

( 1 - ( absolute( SRS - SRE ) / SRE ) ) * 100 < 99.9

That’s basically it, it took time and effort, but eventually, we managed to gather solid and stable data and created alerts based on it.

After one month of monitoring the data, we saw that the SLO we defined for scheduled reports was solid and at the time of writing this blog, 100% of our scheduled reports were sent right on time, and if for some reason in the future it will go under 99.9% we’ll be sure to know about it.

Conclusion

A couple of things that I would like you to take away from this post:

The process of monitoring scheduled events can be complex and time-consuming, and the data may differ at the end of the journey due to metrics time windows, and the way Prometheus query language works (PromQL) when querying the data.
Working with timezones and DST (Daylight Saving Time) is a living nightmare 💀 — We had many bugs due to timezone calculations when generating the cron expression for the scheduled reports. One of the packages that saved our life was date-fns-tz, which enabled us to extract the UTC offset of the timezone with DST a few years back by using a timestamp!
When it comes to metrics for scheduled events we need 100% accuracy. Therefore, Prometheus is not the right tool for the job. If I had to choose a different technology, ElasticSearch would be my choice - it’s fast and enables us to query the data in almost every way we desire without limitations so our data would be accurate, if we had used it we could have queried the metrics by the exact time window we wanted, which could help us solve adjustments issues we had when trying to find the “sweet spot” of when to send the query to Thanos for extracting metrics from Reporting-API.

So there we have it! Hopefully, now you’ll have the tools to monitor your own scheduled events ✌️