Monitoring Geckoboard using Prometheus
It’s been a while since we’ve written an update on what the Infrastructure team has been up to behind the scenes. Last time, we told you about how we rolled Envoy out to improve the way our gRPC apps talk to one another.
Huge thanks to Matt Button and Leo Cassarani for helping me draft this piece and for providing amazing feedback on so much of it.
Geckoboard is a service that helps teams pull their important metrics from the tools they use, and share them prominently on TV dashboards.
Behind the scenes there are many services working together to ensure our customers’ dashboards are showing the most up-to-date information from all of the integrations we support.
Understanding how these distributed services interact in production is challenging. In the beginning we leaned heavily on dashboards of time series metrics and logs to instrument individual systems, but as we added more services it became harder to understand how the behaviour of one system affected downstream and upstream services. Ben Sigelman wrote a great post about this topic, which reflects our experience working on complex systems.
Over the past couple of years we’ve adopted distributed tracing (via Honeycomb) as a tool for better understanding the work our systems are doing. We’ve found that Honeycomb’s approach to instrumentation and querying allows us to quickly answer questions about our system and customers that would’ve been impossible to answer using time series metrics. It’s now the first tool we reach for when trying to understand production.
That said, there are some tools in our stack that don’t have event-based/distributed tracing instrumentation. Critical parts of our infrastructure such as Consul, Vault, and Envoy only expose their internal state as time series metrics, so we still need to track this data.
We’ve previously written about how we adopted Envoy, and the problems it solves for us.
One thing we didn’t really cover in that article was the financial cost of adopting Envoy. Every time we ported a service to Envoy, our metrics monitoring bill shot up by several hundred dollars. By the time we finished the migration, our bill had increased 3x compared to the previous year, and we were paying more for these metrics than we were for our primary debugging tool, Honeycomb.
Like many teams, we have a limited budget for monitoring/observability, so we need to make sure the majority of this budget is going towards tools that will help us do better work. We still need to collect the telemetry these tools emit so that we’re not flying blind, but we need to do it in a cheaper way so that we can invest in tools that help us the most.
We’ve recently finished building out a new monitoring stack to help us save money on our Librato bill while still being able to store and analyze the metrics we used to send there. So this post will be a deep dive into our motivations, what all the components of the new stack are and how we put them all together.
We’ll also be covering some of the issues we ran into while setting all of this up and how we dealt with them.
Context
Insights into how your service and infrastructure are performing are important. They allow you to identify issues, bottlenecks and make informed decisions about how to scale your infrastructure and tune your applications:
- How long do requests take to complete?
- Is an application constantly using 100% CPU?
- Are your instances running out of memory?
When your focus as a business is on developing your product and growing your user base, it’s important to consider how much time (and therefore money) you invest in being able to see and analyze those metrics. If you’re not careful, it’s possible to invest more time in monitoring your systems than building the systems themselves. This problem never really goes away — it’s a constant balancing act.
This a significant reason that many businesses opt to offload those challenges to any number of SaaS platforms which allow them to collect their data and extract meaning from it without having to invest in the infrastructure to do it themselves.
Geckoboard’s journey was no different — in order to focus on building out our product, we used a platform called Librato to collect the metrics that were important to us. Librato also enabled us to build a comprehensive suite of dashboards that our engineering teams could look at to understand the impact of shipping their changes.
For us, the fact that we could offload the work required to store, query and alert on the metrics we were sending was a major factor in choosing Librato to monitor our apps and infrastructure.
As time went on, though, we found that the value we were getting from using Librato was beginning to diminish. The cost to store and query our metrics was climbing and our definition of what was important to us was beginning to change.
We were finding that just seeing changes in the overall trends in high-level metrics we were sending to Librato wasn’t enough. If the rate of HTTP/5xx errors was climbing, was that a coincidence? Was it related to a recent change? If so, what exactly would cause this?
Answering these questions introduced delays between identifying a problem and a fix being pushed to remedy it.
To help us with this, we gradually introduced dynamic tracing to our apps, which meant we could tie events and errors directly to lines of source code and spot precisely where something might be going wrong.
We still needed a tool to store and query all this data, which is where Honeycomb came in. Honeycomb is designed specifically for capturing, sampling & storing your traces. It also provides a powerful interface for querying your data, without needing to learn yet another query language.
We found that the more detail we added to our Honeycomb traces, the more we came to favor it over Librato as our source of truth for how our apps were performing.
In addition, we constantly monitor our application using a combination of GhostInspector and Runscope. These tools allow us to monitor our application and key user flows from the customer’s perspective. They help us quickly identify any customer-facing impacts of changes.
Having rolled out Envoy across our stack, we next looked at how we could monitor its performance and catch any issues before they grew into more serious problems.
But, as we discovered, Envoy generates a lot of metrics. Thousands and thousands of data points, it turned out. Many of Envoy’s metrics are repeated on a per-cluster basis, which resulted in a huge number of events being sent to Librato every minute. This number continued to grow as we reconfigured more of our services to send their requests via Envoy.
While it was possible to turn off specific sets of metrics, doing this is time-consuming, hacky and not really sustainable.
Even at a 60-second interval, the volume of metrics we were generating was beginning to grow our Librato bill. Another issue here was the fact that at a 60-second interval, it wasn’t always easy to spot changes in trends following a new deployment quickly. This meant that we would have to wait for up to 5 minutes to see if a change to an app or to Envoy’s configuration was breaking everything.
The challenge
As we saw it, there were a few key goals in the project to move our Envoy metrics off Librato.
- We needed to find a cheaper alternative to Librato for storing our Envoy metrics.
- Because Honeycomb had become our main source for instrumentation, the savings from our Librato bill could be used to pay for additional capacity in Honeycomb.
- The way that metrics were collected needed to both fit in with our existing setup, and make it easy to monitor more things in the future without requiring a lot of changes.
Gathering metrics
Before collecting any metrics and sending them to Prometheus, we had to work out exactly which metrics we needed to switch over from Librato.
Given that our main aim was reducing our Librato bill, we had a look at our highest-cost metrics:
- Envoy
- Consul
- NSQ
Combined, those three sources alone were costing us approximately $2,000 monthly.
App metrics
Since our apps are also generating metrics of their own, we were keen to ingest those into Prometheus as well (and reduce our Librato bill further). We found, unfortunately, that because of the way that our apps tag their metrics before sending them over to Statsite (a C implementation of Statsd), we couldn’t ingest them in the way that we wanted.
The issue was that because we were using a custom wire format for tagging our metrics, we weren’t able to parse and store it in the format that Prometheus uses.
After some work to see whether anything could be done about this, we decided to drop this piece of work from the scope. We were confident in doing this because:
- The Librato costs to collect app metrics weren’t particularly high
- We already send traces from our apps to Honeycomb which gives us excellent insight into our apps’ performance.
Getting data in
The first step in being able to switch away from Librato was working out how to get the metrics we care about out of our systems and into Prometheus. Because Envoy was our single biggest producer of metrics (by volume), we had to work in such a way as to not change too much inside Envoy.
Thankfully, Envoy already presents its metrics in the Prometheus data format, but there was a catch. Envoy’s metrics are exposed over its Admin API. Because Prometheus scrapes metrics from hosts and services, rather than those services pushing metrics to it, as is the way with StatsD, this would mean opening up the entirety of the Admin API on all of our instances of Envoy to Prometheus, which we weren’t comfortable doing.
We needed to be able to expose Envoy’s metrics endpoint to Prometheus without opening the whole admin API up at the same time.
Telegraf
Telegraf is a tool written by InfluxData to work with their TICK stack for ingesting, processing and forwarding metrics to some sort of time series database. It’s designed to be used with InfluxDB natively, but it can output metrics to one of about 20 other services, including cloud-based monitoring platforms like Librato, ironically.
What makes Telegraf so appealing is that it’s packaged as a single Go binary, with all of its default plugins baked in. It’s configured via a series of TOML config files and supports conf.d style configuration. This allowed us to split all of our inputs and outputs into logical self-contained config files, making it easy to make incremental changes without needing to wade through lines and lines of configuration code.
Because a single input plugin exists in isolation, it can be instantiated several times. This means that if you wanted to scrape multiple Prometheus-compatible metrics API on your host machine, you simply create several prometheus_client inputs and the rest is taken care of!
Telegraf’s plugin-driven architecture also meant that we could gather some extra data about our EC2 estate that was unavailable to us through CloudWatch.
- OS load
- CPU usage (general and per-process)
- Disk space
- Memory consumption
- Network interface stats
While monitoring Envoy and other services’ performance using Prometheus was the main goal of this project, having access to the EC2 instance metrics in the same place added a lot of value. It allowed us to have an insight into resource utilization at a level of granularity which CloudWatch and Librato simply couldn’t provide.
Storing metrics
Prometheus
We settled on Prometheus as our preferred alternative to Librato pretty early on for a number of reasons.
The first reason was its relative maturity and wide industry adoption — many companies monitor their services using Prometheus. Not only does this inspire confidence, but it also means that we aren’t operating in a vacuum. Specific implementation challenges and questions have likely been solved by other folks trying to do the same thing.
Because Prometheus is something of an industry standard when it comes to collecting and storing metrics, its data format is supported by many existing tools already. Envoy, Consul, and Vault all expose their metrics in a Prometheus-compatible format, all of which are tools that we use inside our infrastructure.
Prometheus is also flexible in what it can monitor. It can be used to monitor a static set of hosts, a bunch of Kubernetes pods and anything in between. As we’re considering how we can fit schedulers and containers into our infrastructure, not having to re-tool our monitoring stack to make it scheduler-friendly is very useful.
Another consideration in rolling Prometheus out was how to query it. It comes with its own query language (PromQL), but compared to other options like InfluxDB and Wavefront, it’s reasonably straightforward to learn. InfluxDB and Wavefront’s query languages are powerful and flexible but that power presents a barrier to entry for folks who just want to get information out.
Target discovery
Because Prometheus is a pull-based system, it needs to know about its “scrape targets” before it can begin to ingest any metrics. There are a handful of ways to do this:
Static config
The most basic way to configure Prometheus to scrape a set of targets is to explicitly define them.
scrape_configs:
- job_name: prometheus
scrape_interval: 10s
static_configs:
- targets:
- 10.0.0.1:9090
- 10.0.0.3:9090
This would not work for us, given that we have a set of EC2 instances, any of which may be destroyed and recreated (rotated) at any time. This can be because the instance has become unresponsive or because we need to roll out a new AMI because of a kernel upgrade.
Each of our environments has its own consul cluster for service discovery, so we could’ve used consul to discover targets. However, all of our EC2 machines already have instance profiles (docs link) that give them access to the APIs for listing machines in their environment. Using the EC2 APIs means we can detect machines that aren’t running consul, or that are having issues connecting to the cluster
Cloud service discovery
However, Prometheus supports target discovery through cloud provider APIs instead — it works with Azure, GCP, OpenStack, AWS and so on. This meant that it was now possible for us to configure Prometheus to poll the EC2 API, list all the running instances and filter them to get the list of targets for Prometheus to scrape.
scrape_configs:
- job_name: infrastructure
scrape_interval: 10s
ec2_sd_configs:
- role_arn: arn:aws:iam::1234567890:role/prometheus-role
port: 9273
filters:
- name: tag:Environment
values:
- staging
There are a few ways to grant Prometheus access to the EC2 API, but the easiest way that we found was to use an AWS IAM Role for the instance and grant it access to do what was necessary. This avoids having to store AWS keys on the instance and then having to worry about rotating them and keeping them secure.
At a minimum, the IAM role must have the ec2:DescribeInstances permission so that Prometheus can list running instances and get their tags.
Scaling and performance
One aspect of running Prometheus that’s worth discussing is how it scales as there are some gotchas there. Because our main source of instrumentation is Honeycomb, we can accept some loss in durability and availability for Prometheus. Because of this, HA wasn’t as much of a concern.
Prometheus supports horizontal scaling to provide a highly-available way of collecting and storing time-series metrics. Running it in practice, though, means a lot of duplication of data as a second Prometheus instance would be configured to scrape the same data as the first.
Prometheus also supports federation, where a Prometheus instance can be configured to scrape other Prometheus instances to provide a federated view of the datasets that the other instances have.
Storage
Because Prometheus stores all of its metrics on disk as compacted TSDB blocks, the instance would need a considerable amount of storage which was also persistent across instance reboots.
Because we rotate our EC2 instance frequently, this was a potential blocker. We wanted to avoid creating pet instances at all costs because over time they would require additional considerations when making changes to our configuration.
Ultimately, we solved this by adding a large persistent EBS volume to the instance and using some scripts passed to the instance in User Data which mounted the volume before the instance was provisioned using Chef. This meant that we could terminate a Prometheus EC2 instance, the persistent volume would be unmounted, then the new instance could mount the same volume and pick up where the other Prometheus box left off.
Remote write storage was also considered, but out of all of the options for remote write targets, PostgreSQL was the only one that we could make use of. The cost of it, however, would have been prohibitive — the point of this project was to spend less money, after all.
Thanos
While researching how other folks solved the problem of not wanting to run federated Prometheus clusters and wrangling remote storage, we came across a blog post from Monzo, talking about their experience.
One of the most interesting things that they talked about in their post was how they solved that problem of observing multiple Prometheus instances from a single location.
Thanos is a tool written and open-sourced by Improbable which is designed to add scalability to Prometheus without having to run a lot of complex federation.
Their post talking about the release of Thanos goes into a lot more detail and we’d recommend reading it.
Thanos ships as a single binary with several discrete functions:
- Sidecar — responsible for uploading compacted TSDB blocks to cloud storage (Amazon S3)
- Store — for retrieving blocks from S3 and allowing them to be queried via the Store API
- Compactor — downsamples data that is older than a fixed timespan, which results in smaller query results, improving performance.
- Query — for handling PromQL queries and returning results. This is where the magic happens.
Thanos Query is what makes Thanos so flexible. Query exposes a Prometheus-compatible API for clients like Grafana to connect to and issue queries as though they were talking to a vanilla Prometheus service.
On the backend, Query needs to be configured to forward those queries to one or more destinations. The real genius of Query is that those destinations can be one or more:
- Prometheus services (Prometheus API)
- Thanos Store instances (Store API)
- Thanos Sidecar instances (Store API)
- Other Thanos Query instances (Store API)
Because it’s possible to point one Query service at another Query service, there is no need to build complex logic for aggregating different datasets — it’s all done for you in Query.
The other challenge that Thanos helped us overcome is the concern about how we’d store all the time series data Prometheus gathered. Rather than having to worry about the size of the persistent disk we’d provisioned for Prometheus not being enough, we’re able to offload all the compacted TSDB blocks to S3 and query them from there.
Thanos performance
Despite all of this, we were still seeing performance issues when querying Envoy metrics for a long time period — greater than about 2 days. The sheer volume of measurements that Envoy generates results in a huge query result.
In practical terms, we were finding that every time a large query like that was run, we’d see a multi-gigabyte spike in RAM usage by Thanos Store and, to a lesser extent, Query. This was due to the fact that they were having to fetch blocks from S3 and load them into memory to be able to serve the Grafana query.
This happened every time that the query was run and would cause Prometheus to stop collecting data for anywhere between 2 and 5 minutes, as Thanos Store was colocated with it. The instance ran out of memory and the Prometheus process was killed and restarted. Dashboards that auto-refreshed were especially problematic in this regard.
Adding more resources to Prometheus and splitting Thanos Query to its own EC2 instance helped but it was more of a sticking plaster.
Trickster
After doing some research, we found Trickster. Trickster is a tool designed to reduce the load on Prometheus by caching query responses. It supports a number of caching backends, including Redis, Memcached and in-memory caching.
Trickster sits between Grafana and Thanos Query. It caches all query results, making subsequent queries for the same data much faster, allowing fast response time on our Grafana dashboards. We configured it to store the cached data on disk, as the performance was more than adequate for our needs.
Another benefit of the way that Trickster caches data is that when a dashboard with an automatic refresh timer issues its next query, only the new data needs to be fetched from Thanos, while the bulk of the response is served from the cache.
Issues and gotchas
Trickster was an excellent addition to our monitoring stack, but there was an edge case which we hadn’t anticipated. In our setup, it’s co-located with Thanos Query, but Prometheus and the Thanos Store components run on a separate box so that we could separate the query and storage functionality.
As it turns out, when the Prometheus instance is being rotated and Thanos Store disappears temporarily, Thanos Query returns empty results (as you would expect). Trickster then caches this new empty result, but when Thanos Store comes back up, the empty result is what persists in the cache, requiring a cache flush.
Visualizing metrics with Grafana
Having worked out how to ingest, store and query our metrics, we needed to have some way to visualize them. Grafana is an open-source dashboard platform that is designed to interact with several time-series databases, including Prometheus, Graphite, and InfluxDB. It also comes with an excellent set of plugins out of the box which extend its functionality to be able to query your cloud provider’s metrics natively.
Another great feature of Grafana was the fact that dashboard configs were stored as plain JSON, allowing for quick and easy importing of third-party dashboards. Being able to use pre-existing dashboards as a jumping-off point for building our own meant that we could see how other organisations had designed their queries to return the data that was valuable to them.
One particular aspect that we had to consider when rolling out a frontend for our new monitoring stack was how we went about securing access to it. Usernames and passwords were an obvious choice, but it would also mean that our engineers would need to maintain yet another set of credentials.
To get around this, Grafana natively supports logins using OAuth2 via an SSO provider such as Google. This vastly simplifies the user onboarding process as users can be logged in automatically using their Google account, providing their login session is valid.
We’ll be writing up the process of setting up OAuth2 for Grafana and other services via Amazon Cognito in a follow-up post because that alone deserves its own detailed run-down.
Final thoughts
While setting up something as exciting as a new monitoring stack, it can be easy to get carried away and let the scope of the project creep endlessly. To some extent, we’ve been guilty of this because we’d never done anything like this in-house before, so at times the project felt more like an exploratory “spike”. There were times where, even though we had a functioning system, it felt like the system needed to be improved before it could be considered production-ready.
Adding query caching could be considered one of those unplanned improvements. It only became apparent that querying our Envoy metrics generated a huge spike in resource use once we began to ingest metrics properly and query them across longer and longer time periods.
One thing in particular that we have not made use of in Prometheus or Grafana is the alerting system. There are two places where alerts can be generated: Grafana itself or Prometheus’ own AlertManager subsystem.
The metrics which we switched over from Librato were those we did not need to alert on, meaning we could trim the scope of the project to just collecting and displaying the metrics on dashboards, without needing to worry about also alerting on them.
It’s possible that at some point in the future, this will become a pain point and we will need to review whether we alert on any of the metrics that Prometheus collects and stores.
Thanks for reading!