Scrape to Graph: Tracing a variable across the OSS Monitoring stack with Prometheus and Grafana in Kubernetes
Open Source Metrics Stack
Metrics are a critical aspect of any system to understand its health and operational state. Design of any system requires collection, storage and reporting of metrics to provide a pulse of the system.
Data is stored over a series of time intervals and needs an efficient database to store and retrieve this data.
OpenTSDB Time Series Database is one such time series database that can serve that need.
While data is stored in a time series database, an standard system to scrape such metrics and store it in the database has emerged in form of Prometheus . When a data source exports metrics in prometheus exposition format, it can be scraped by prometheus. While time series database stores metrics, Prometheus collects the metrics and pushes them in the database.
Any database also needs an efficient and programmer friendy way to query information, eg: SQL for popular transactional databases like Postgres and MySQL. Prometheus defines a rich query language in form of PromQL to query data from this time series database.
Any form of reporting solution isn’t complete without a graphical component to plot data in graphs, bar charts, pie charts, time series and other mechanisms to visualize data. Grafana serves this need where it can take a data source (like Prometheus) and provides the programmability and flexibility to display data in a form that is useful to the user.
We broadly cover how the different systems work in tandem, what makes them stick together, some detail of each sub-system and tracing a variable from end-to-end to gain a clear understanding.
Enroute is built using Envoy proxy and both sub-systems provide a prometheus endpoint to scrape/export the operational metrics in the system.
Discovery, Storage and Querying of Metrics Data
Finding Instances to Scrape using Service Discovery
Prometheus needs a port and path to scrape data from. How does it find that? This problem is addressed by Prometheus service discovery.
You can set the instance port and path to scrape in a prometheus config file. With dynamic environments like kubernetes where containers, endpoints, pods, services, ip-addresses are transient, a static configuration won’t work. A more dynamic approach is using service discovery and providing instructions about how to work with discovered services. This is where prometheus service discovery can be configured to discover instances (or job which is a collection of instances).
Prometheus service discovery supports several environments like Kubernetes, Azure, Digital Ocean and several other options. A complete set of service discovery options can be found on prometheus configuration reference . The
<*_sd_config> are all different environments in which prometheus has built-in support and can be configured for service discovery.
Service discovery returns a list of instances to scrape metrics from. The discovery process may also discover additional metadata about the discovered instances. This additional metatdata may be used by the user to ignore, filter, customize or to add attributes to the collected data. Prometheus uses a relablelling mechanism to achieve that.
In the example above from prometheus github , the configuration file for prometheus specifies kubernetes service discovery using
kubernetes_sd_config directive. Note how relabelling can be used to ignore certain instances and replace variable names.
Prometheus also shows the list of targets that it has discovered on the
Reporting of variables
The first requirement for the system to work is how variable state or metric is reported. When a system that has metric to report, it is broadly classified as the type of data which is either a counter that monotonically increases or a gauge that can increase or decrease over time.
Such a distinction at the source of where the metric is generated is required to facilitate the metric stack to correctly query, store and retrieve data.
Prometheus has several libraries in different languages that help facilitate defining the variable type, operating on it and exporting it.
Scraping and storage of variables
OpenTSDB supports a multi-dimensonal data model where a data point can be identified using a name and key/value pairs. As an example, let us consider the data model with an example.
When a request traverses through Envoy, it is proxied and connects to an upstream to serve the request. One such variable for upstream stats is
The figure below shows prometheus exposition format that is used as an export format for the metric flows over the wire:
GET the information about this variable in Envoy from
/stats/prometheus url of envoy stats endpoint:
TYPE that tells prometheus about the type of variable.
counter is one type of variable supported by prometheus.
envoy_cluster_upstream_rq is stored in OpenTSDB , which in addition to the value of the variable stores a timestamp and key/value pairs. These key/value pairs facilitate querying of data. The following are all valid queries -
- get a total count of upstream requests where response code is 200.
- get value of all upstream requests for cluster hello-enroute_9091 where reponse code is 429 (rate-limited)
Querying of variables
Next we look at how data can be queried once it is stored in TSDB. Querying is how a tool like Grafana pulls data from Prometheus/TSDB
Note that data variables hold not only time-series value but also an associated key/value. These attributes can be used for querying (eg:
envoy_response_code key). This querying is what makes it powerful. There are several ways to look at the same data. While envoy instance from which data was scraped may report a couple of dimensions (or additional key/value tags), additional tags may be added by Prometheus. For instance Prometheus adds kubernetes metadata like service (eg:
service key), namespace (eg:
kubernetes_namespace key), pod-name, endpoint info etc. when it scrapes and stores information from an instance. All this additional tags can be used to query the data.
The query language to support rich set of use-cases is critical, and PromQL provides that flexibility. We quickly take a look at how to use PromQL to form queries. More details about PromQL can be found on the prometheus PromQL
Going back to the original variable
envoy_cluster_upstream_rq, it counts the number of upstream requests for clusters. Here are several ways to get this counter:
Query for service enroute
Query for service enroute AND response code 200
Query for service enroute AND response code 200 and namespace enroutedemo
Note that the above value is time series data. The above is a fairly simple example. While querying a time series the offset, duration, subquery, aggregation, function and other qualifiers can be used to extract more fine-grained data. For example, you can dump the [30m] rate of this time-series:
More detailed examples can be found on prometheus documentation reference
Types of Metrics
Prometheus system offers a few core metric types that are supported by the system.
- Counter: monotonically increasing value, eg: request count, connection count etc.
- Gauge: can increase/decrease in value, eg: upstream count, certificate count, cluster count
- Histogram: Bucketed counters
- Summary: Is just like histogram but also calculates quantiles.
A detailed explaination of these variable types can be found on prometheus section on variable types
Statistics in Envoy and Enroute
Enroute is a lightweight shim that provides a cache for Envoy config. It exports metrics that are relevant to the internals of the cache it maintains to serve Envoy. Additionally, it has has metrics about other subsystems that work in tandem to provide API security.
A lot of interesting state and metrics data is provided by Envoy. A proxy like Envoy has several sub-systems that exports a lot data. We won’t look at all the metrics but high level components and the data they export. The Envoy documentation reference provides a complete list of stats exported.
Grafana Dashboards and Panels
Grafana is an open source tool to build an observability dashboard. The primary abstractions used to build a UI in grafana is a Dashboard that is made up of several Panels. A panel embeds a query that pulls the data to be displayed and various styling and formatting options. An intermediate step between querying and visualizing the results of a query is transformation. As the name suggests, transformation provides a mechanism to work on the data before it is passed to visualiztion.
Grafana supports several data sources one of which is Prometheus.
In Grafana, you build graphs using Visualizations. Visualizations roughly map to displaying data types defined earlier in Prometheus eg: counter, gauge, histogram and summary. There also is flexibility in how they are displayed as graphs, bar charts as tables or heatmaps.
A more detailed explaination of Dashboard , Panels , Querying, Transformation and Visualizations can be accessed from the Grafana website.
Here is an example Grafana Dashboard that is made up of panels -
The dashboard is made up of 12 panels across 3 rows and 4 columns. Each panel has a query to pull data from a data source. The panel also has a JSON representation in grafana. Here is one such representation for Downstream 2xx Responses
The representation above has a lot of fields removed just to highlight the interesting ones to build an understanding of how a variable reported in enroute/envoy is displayed. The interesting parts are
datasource and the
targets.expr that captures the query. Note the PromQL syntax and how the cumulative value of 1m rate is plotted over time.
fetches the value of
envoy_http_downstream_rq_xx for a reponse code
Here is another example that plots a histogram of latency to upstream server after importing swagger spec using
Other tools like Prometheus and Grafana
There are use-cases where: either more variety of data types and metrics push is needed (InfluxDB), a unified solution for storage and graphing (Graphite), efficient indexed text based queries (ELK/Lucene), network service monitoring like Nagios. There are a wide variety of open source and paid tools depending on the needs and use-cases.
Open Source Metrics with/without Kubernetes
Enroute Universal API Gateway with its API both for Standalone and Kubernetes works with Prometheus and Grafana based open source telemetry. Both Enroute and Envoy can export operational data in Prometheus Exposition Format. Telemetry is critical for operations and using an open source format like Prometheus and Grafana ensures that programmable insights can be derived from data and the API gateway adheres to the overall logging architecture and choices.
When working with a complex system, any downtime is unacceptable. Understanding the root cause requires metrics to understand and troubleshoot. Logging and monitoring are critical to ensure the DevOps team has the necessary tool to keep the system running.