The road to world-class monitoring at Azimo
When you think about IT systems monitoring, on a high level, we can distinguish three high-level phases of it:
- Your customers are your monitoring — they are the only source of knowledge if something goes wrong.
- Your customers are faster than your monitoring — they tell you when something goes wrong, and later you can see it clearly in your logs.
- Your monitoring is faster than your customers — you know that something doesn’t work properly before anyone notices it.
If you want to be serious about sending money worldwide, as we do, this is not an option to stay at the first or even the second level. When thousands of people send their money home, day and night, relying on the quality of your service, you need to keep your hand on the pulse all the time.
To do that at Azimo, a few years ago, we introduced on-call shifts. Since then we have dedicated engineers to take care of any problems in our production systems, also out of working hours. The solution was simple — integrating current monitoring alerts with PagerDuty, which helped us structure the whole process. And while the initial idea was great in theory, in practice, we opened a pandora box. It took us multiple weeks and many sleepless nights until we stabilized things, but in the end, we built monitoring that was good enough for our needs.
This year we decided to move on, as “good enough” wasn’t satisfactory for us anymore. We kept discovering blind spots not visible to our alerting system, we dealt with a big chunk of false positives, and we found some flaws stopping us from scaling up. With all of that in mind, we outlined the desired monitoring system, which brought us to one of this year’s objectives for the platform engineering team — to build world-class monitoring at Azimo. This article will walk you through the pyramid, which reflects our priorities for that goal.
Automated monitoring dashboards
We have multiple teams working on different domains, each having its own monitoring dashboards with different panels, error descriptions, and monitoring rules.
To lay the groundwork for future improvements, we decided to standardize all dashboards. Our SRE & DevOps team implemented a solution that keeps Grafana monitoring configuration as a code in the repository of a microservice. Thanks to that, the deployment of a dashboard became a part of our CI/CD process.
It unlocked a vast amount of other benefits. All microservices now have a single standard of monitoring panels, layouts, and alert descriptions. We can still fine-tune each of them in terms of the specific thresholds for alerts. But thanks to standardization, now we can quickly introduce new improvements to the dashboards of all microservices within a single click.
Stability of alerts
After the migration to standardized dashboards, we started facing many false-positive alerts. It was an expected situation — some of them required thresholds fine-tuning, depending on the nature of particular domains or services.
To stabilize our alerts, we introduced a definition of an unactionable incident, which for us was the one that we resolved within less than two minutes. If something takes so little time, it is almost certain it doesn’t require human intervention.
After integrating our on-call system with the data warehouse we started tracking unactionable incidents, which led us to one of the key results:
“Unactionable incidents should be less than 20% of all incidents.”
After achieving our goal, we reduced the noise and prevented monitoring fatigue among our engineers. It also helped us decrease the costs of on-call shifts. Fewer unactionable incidents after working hours mean less overtime. In the end, we reduced the amount of overtime booked by our engineers by 8x!
Fine-tune severity levels
“If everything is important, then nothing is.” Our next step was to outline the list of critical microservices, so we could stop waking up our engineers at night if something less relevant was broken in our system. To do that, we shaped a clear definition of SLA, pointing out business-critical features of our platform and microservices responsible for them.
With these clear rules, we introduced a feature in our automation to mark certain microservice dashboards as non-critical and treat them as such in our on-call system.
After changes, our engineers could handle some less critical alerts on the next working day instead of being wake up in the middle of the night.
Business/success metrics and alerts
Lack of errors doesn’t always mean everything is fine. Once we learned this lesson the hard way. For some reason, one of our external integrations stopped accepting card payments. Yet there were no errors in our infrastructure — the partner wasn’t returning any.
The majority of monitoring dashboards for our microservices consist of technical indicators like error counts, request/response times, Kafka lag,
memory, CPU, DB, etc. This time, it was also critical to monitor business (or success) metrics to spot anomalies. And they are very driven by the entire domains rather than by specific deployment units.
So we did that — we grouped some dashboards differently. They are still maintained as a code, but they don’t belong to any specific microservice repository. We created a separate repository dedicated to those domain-driven dashboards. They have their own structure and CI/CD process, separated from microservices.
Those metrics are driven by the business and they should also be understood by non-technical people. They are grouped per domain.
- number of accepted payments,
- number of paid out transactions,
- a number of registered users.
An anomaly on such metrics can show a problem even with a lack of errors in the infrastructure.
Business metrics dashboards are still a work in progress. Still, once we finish this significant step, it will further ease our sleep at night, guaranteeing that we will be able to spot anomalies as early as possible.
Monitoring dashboards for internal teams
There are situations where non-engineering teams need to solve the problem or be aware of an issue. That’s why we built dedicated monitoring dashboards tailored for our Customer Support or Operations divisions. With customizable alerts connected with their notification channels, we could support teams with:
- Faster reactions to failures of our partners and getting into contact with them to resolve the issue,
- More efficient communication with our customers about potential failures,
- Indirectly improving time to market for our engineers as they don’t get distracted in assisting those teams in such problematic scenarios.
Azimo is structured in a way where different teams are responsible for different domains. Yet, during on-call shifts, our engineers are responsible for overlooking the whole platform. It led us to the question: do engineers know what to do when they are woken up in the middle of the night?
The answer is — not always. So to address that, we plan to enrich our alerts with data and relevant links. The goal is to have a step-by-step guide that engineers who aren’t familiar with the domain affected by the issue could follow.
Automated SLA tracking
Azimo’s SLA says that our platform must be available for at least 99.95% throughout the year. Currently, these metrics are calculated in a semi-manual process based on the incident reports we are aggregating.
As icing on the cake, we plan to fully automate this process based on a few of the components we had created along the way. They are:
- Automation and standardization of the dashboards,
- ETL process between on-call tool and the data warehouse,
- The correct granularity of configuration in our on-call tool (PagerDuty).
Where are we now, and where are we heading?
It has been some journey to get to the current stage where:
- We have engineers assigned to on-call shifts where we are ready to fix potential problems 24/7,
- We have 100% of automated dashboards for our microservices stored as code,
- We have unactionable incidents below 20% of all incidents,
- We distinguished handling of SLA-critical vs. non-critical incidents,
- We decreased over hours during on-call shifts by 8x,
- We implemented and automated monitoring of success/business metrics,
- We empowered internal teams with dedicated monitoring dashboards,
- Most importantly, all of the above significantly contributed to the stability and availability of our platform.
We are delighted where we arrived, yet we know that there is still work to be done. We are also aware that you start seeing diminishing returns in your investment with each more detailed further improvement. That is why we are prioritizing work very wisely against other platform improvements not related to monitoring.
No matter where you are in your journey to world-class monitoring, I hope that our experience will be of some value and make the bumps smoother.
Towards financial services available to all
We’re working throughout the company to create faster, cheaper, and more available financial services all over the world, and here are some of the techniques that we’re utilizing. There’s still a long way ahead of us, and if you’d like to be part of that journey, check out our careers page.