Observability is a very popular term right now. And a big part of it is Monitoring. A concept which many teams do not yet take extensive leverage of.
Scenario: From Logs to Metrics
Many teams find themselves at the crossroads between having a logging stack in place and integrating a (new) monitoring stack. The responsibility for Monitoring and subsequently Alerting is not a purely operations topic anymore.
Basically, you do a lot of logging in your application, as this is a very familiar technique when it comes to increasing transparency. You have already set up a powerful central logging stack — probably with elasticsearch in the background. You realise, it’s not enough to fulfil the Observability needs the business demands right now.
So you dip your toes into the world of Monitoring. And here as well, you let yourself guide by CNCF backed technologies. You set up a prometheus with grafana as your frontend. You are amazed how easy it was to integrate that into your current tool stack. However, you feel like in the past, a lot of effort was made to gather extensive insights via your logging stack. But it was not made for advanced visualisations and alerting which you desperately need now. You have a tough decision to make:
Do I invest a considerable amount of resources to migrate insights into our system from logs to metrics?
I believe many teams find themselves in that or a similar a setting. We did at least. And we simply could not heavily invest into metrics “just for the sake of having them in a different tool stack”.
Use Case: Exception Monitoring
Our initial objective was monitoring of exceptions thrown over time. In an extensive microservice architecture with multiple hundreds of very diverse service instances, synchronous and asynchronous communication as well as eventual consistency, we do have a special relationship towards exceptions. There are exceptions which are “acceptable”, in terms as they can be handled gracefully (eg. DuplicateKeyException). There are also exceptions which result in endless loops (between the message broker and the respective service). They do not always immediately impact our business cases due to scaled instances, but they definitely need to be found and fixed ASAP.
We could not afford to adapt every single service to indicate “exception thrown” as a metric. Additionally, we already had that information accessible in our log database (elasticsearch). We figured:
Why not query that database and expose the result as a metric our prometheus can scrape?
And that is when we developed ELCEP, find it on github. ELCEP stands for
Elasticsearch Log Counter Exporter for Prometheus
Let’s take the monitoring of exceptions as an example. When an exception is thrown in your “notificationservice” it is being logged and ends up in your elasticsearch database with a message containing for example NullPointerException — probably in the first line. To find such a log in kibana you would use the following lucene query:
message:*Exception AND service:notificationservice
This is exactly how you would configure an additional counter in ELCEP via queries.cfg:
all_notificationservice_exceptions=message:*Exception AND service:notificationservice
Whereas the key left of the “=” is used in the metrics name which will result in exposing the following on the :8080/metrics http endpoint:
# HELP elcep_logs_matched_all_notificationservice_exceptions_total Counts number of matched logs for all_notificationservice_exceptions
# TYPE elcep_logs_matched_all_notificationservice_exceptions_total counter
We already support a plugin feature which we plan to introduce in a separate blog post. An additional upcoming feature is bucket aggregation which will take leverage of multi dimensional metrics. Generally we are continuously working on improving the tool and are looking forward to feedback and contributions.