Microservices monitoring with Envoy service mesh, Prometheus & Grafana

Arvind Thangamani
HackerNoon.com
6 min readNov 3, 2018

--

If you are new to “Service Mesh” and “Envoy”, i have a post explaining both of them here.

This is the second post in the Observability with Envoy service mesh series, you can read the first post about Distributed Tracing here.

With microservices you cannot be in the dark when it comes to monitoring, you need to at least know that something is going wrong.

Let us look into how Envoy can help us get some info on what is going on with our services. With a service mesh, all the traffic goes through the mesh, meaning no service talks to the other service directly, the service make a call to Envoy and Envoy will route the call to the destination service, so Envoy will have context about the incoming and outgoing traffic. Envoy generally provides metrics about the incoming requests, outgoing requests and the state of the Envoy instance.

Setup

Here is an overview of what we are trying to build

overall setup

Statsd

Envoy supports publishing metrics in 2 or 3 formats, but for this post we will use statsd format.

So with that said, the flow will be, Envoy pushes the metrics to statsd and from statsd we will pull the metrics using prometheus (a time series database) and then we will visualise the metrics using grafana.

In our setup diagram i have mentioned statsd exporter instead of statsd for a reason, we are not going to have statsd as such, we are going to have a converter(service) which will accept data in statsd format and expose it in prometheus format. Gets the job done for us.

Envoy’s metrics can be majorly classified into two

  1. Counter: An ever increasing metric. E.g.: total number of requests
  2. Gauge: A metric that can go up or down, like an instant value. E.g.: current CPU utilisation

Let us look at an Envoy configuration with stats sink

Envoy configuration with stats sink

lines 8–13 tells Envoy that we need metrics in statsd format, what is the prefix for our stats(usually your service name) and the location where our statsd sink lives

lines 55–63 configures the statsd sink in our environment

that is all the configuration that is needed to get stats out of Envoy. Now if you look at lines 2–7, there are two things happening

  1. Envoy exposes an admin endpoint on port 9901 which you can use to dynamically change the log level, view current configuration, stats, etc..
  2. Envoy can also generate access logs similar to nginx, which you can use to understand your traffic. the format of the access log is also configurable, lines 29–33 does exactly that

You need to add the same stats configuration to the other side car Envoy’s of the services in our system (yes, every service has its own Envoy side car).

The services themselves are written in go and they do not do much except for calling other services through Envoy. You can look at the service and Envoy configurations here.

So right now we only have statsd exporter in the picture, with this, if we run the docker containers(docker-compose build & docker-compose up) and send some traffic to Front Envoy(localhost:8080), Envoy would start sending metrics about the traffic to our statsd exporter, which will convert the metrics to prometheus format and expose it in port 9102.

This is how the stats look like in statsd exporter

metrics from statsd exporter in prometheus format

there would be hundreds of stats and in the above screenshot we are seeing the latency metrics for communication between Service A and Service B. The metrics in the above image are in prometheus format

You can read more about it here.

Prometheus

We are going to use Prometheus as our time series database to store our metrics. Prometheus is not just a time series database, it is a monitoring system in itself, but in our setup we will use it as a datastore for our metrics. An important thing to note is prometheus is a pull based system, which means you have to tell prometheus where to scrape the metrics from, in our case it will be our statsd exporter.

Adding prometheus to the equation is very straightforward, we just need to pass the scrape targets(statsd exporter) as a configuration file to prometheus. Here is what the configuration will look like

prometheus configuration to scrape from statsd exporter

scrape_interval is the frequency in which prometheus will pull configuration from the target.

So now we should have Prometheus up, and some data in prometheus as well. Let’s fire up locahost:9090 and see what it has

prometheus query page

as we can see, our metrics is available. You can do a lot more than just selecting existing metrics, you can read about prometheus query language here. It can also plot graphs based on our queries. Has also an alerting system.

If we load up the targets page in prometheus we see all the scraping targets and health of those targets

prometheus targets

Grafana

Grafana is an awesome Visualisation & Monitoring solution which supports a lot of backends like Prometheus, Graphite, InfluxDB, ElasticSearch, etc...

Grafana has two major components that we need to configure

  1. Datasource: The backend from which grafana will get the metrics. You could configure the datasource using a configuration file which will look like this
configuring prometheus as a datasource in grafana

2. Dashboard: This is where you visualise the metrics from your data source. Grafana supports a wide variety of visual elements like Graphs, Single Stats, Heatmaps, etc… and you can extend this and build your own using plugins.

The only problem i have with Grafana is that there is no standard way of developing these dashboards as code. There are some third party libraries which support this and we will use the one from weaveworks called grafanalib.

Here is the dashboard that we are trying to build expressed as python code

Grafana dashboard as code using grafanalib

We are building graphs for 2xx, 5xx and latency. lines 5–22 is important, it is extracting the service names available in our setup as grafana variables, it makes our dashboard dynamic, meaning we will be able to select the source and destination service for which we want to view these statistics. More about variables here.

You have to use the grafanalib command to generate the dashboard from the above python file

beware the generated dashboard.json is not easy to read.

So we just need to pass the dashboard and the datasource while starting up Grafana. And when you visit http:localhost:3000, you will be greeted with:

grafana dashboard

there you go, you have your 2xx, 5xx and latency charts and you also see the dropdown where you can select the source and destination services. There is more to grafana than what we have discussed, there is a powerful query editor, an alert system. More importantly, everything is extensible using plugins and applications, checkout an example here. If you are visualising metrics of common services like redis, rabbitmq, etc.. Grafana has a repository of public dashboards from which you can just import them and use. One more good thing about Grafana is you can create and manage everything with configuration files and code without dabbling much with the UI.

I would urge you to play with prometheus and grafana to figure out more. Thanks for you time. Please leave your feedback as comments.

You can find all the code, configuration files here.

--

--