Microservice monitoring through Kubernetes custom resource definitions
On the journey towards a microservice architecture, Personio as a company has been quite successful in adopting tools for building and deploying our software. One aspect which was challenging for us until recently was monitoring these services once they were running in production. As the pace of developing new services was picking up, keeping alerts and dashboards up to date was starting to become harder. This was mostly because they were being defined through a UI.
We realized over time that defining monitoring in this manner simply did not scale with the growth of the organization. Alerts often suffered from configuration drift, a lack of collaboration and peer review. It was hard to make system-wide changes. If we wanted to solve the problem, services had to have their monitoring defined as close to the code as possible, just as they had a Dockerfile and a build pipeline.
After looking at existing solutions, we did not manage to find a tool that would fully solve our problem. However, most of these tools served as an inspiration for us to develop an open source software that would integrate two crucial components of our infrastructure, Kubernetes and New Relic. In this article we will explain how monitoring became an integral part of every piece of software which now runs in production.
A quick glimpse at existing solutions
While most vendors in the monitoring space do provide some means of defining dashboards and alerts as code, many of these methods rely on using a particular piece of technology through which you implement alert management.
One set of solutions revolves around the various terraform modules available for vendors such as Datadog or New Relic. Unfortunately, a major problem with terraform is the fact that it runs client-side and therefore makes secret management complicated business. For instance, at Personio we use New Relic as our monitoring platform and it requires the use of an Admin API Key for defining alerts and dashboards. Having API keys with a high level of access spread across tens or hundreds of repositories just for the sake of defining alerts appears to be a huge limitation of using terraform for this purpose.
A better solution regarding these constraints would have been Alertmanager. It removes the need for end-users to manage secrets, and integrates natively with Kubernetes. Unfortunately, this tool is highly coupled to Prometheus. We did, however, use it as an inspiration for what a good alert management tool would look like from an API perspective.
Alerts as code in Personio
With the rapid increase in the number of microservices being developed at the company, we saw the need for reusable and scalable monitoring across the board. What we set out to achieve was establishing a baseline of alerts and a dashboard that would come out of the box with every new service that gets added to the system. One part of such a baseline would be key metrics we wanted to always keep an eye on, such as error rate, latency and resource consumption. In addition to this pre-defined baseline, we also wanted each service to be able to either define new alerts or modify the ones coming from the baseline if they were not adequate for the service. Finally, we set a hard requirement to find a way to enable teams to be fully responsible for defining how a service is monitored, and to be the first ones that are notified when something goes wrong.
At Personio we rely heavily on Kubernetes to run almost all of our workloads. As we are also fully in AWS, we leverage EKS to a large extent to operate multiple Kubernetes clusters and manage them through terraform. Each microservice defines the necessary Kubernetes manifests through a helm chart. Finally, once in production, services are monitored through New Relic alerts and dashboards specific to that service. Since we use Kubernetes extensively and our developers are experienced at deploying their services through Kubernetes manifests, Kubernetes Custom Resources seemed like a natural integration point between the workloads we run and the alerts and dashboards that monitor them.
Out of all of these constraints, the New Relic alert manager, a Kubernetes operator that manages New Relic alerts and dashboards, was born. An in-cluster operator allowed for all the alerts and dashboards for a service to live in the same repository as the service itself. This approach solidifies the ownership of service monitoring with the team developing the service. The way a new service is initialized at Personio is by cloning a template repository and filling in parameters like the service name and the team which owns it. This allows us to package premade monitoring with sensible defaults as part of the helm chart of the service template. As a result, when a new service is created, it comes with basic monitoring already set up with key metrics and wired to trigger notifications to appropriate channels.
At the end of the entire process, we managed to empower developers to fully manage the monitoring for their services, as they are the ones who best understand how the service works and where it can break. However, when they decide to bring a new service into the system, they do not need to start from scratch and deal with technicalities like naming conventions and notification channels.
We believe we are not the only company facing these challenges, and have therefore also decided to open source our alert and dashboard templates together with the project. Our monitoring baseline is heavily inspired by the Google SRE Book and we are more than happy to take feedback or contributions in any form.