Horizontally Scaling Prometheus at Wish

Tom Wiedenbein
Wish Engineering And Data Science
5 min readDec 2, 2019

Introduction and Motivation

We’ve been running Prometheus at Wish for two and a half years now. One of the things we’ve learned during this time is that scaling Prometheus in a way that fits the needs of your environment isn’t always straightforward. Vertically scaling Prometheus can only get you so far. We found that trying to ingest too many metrics on one box causes availability issues and thus a loss of metrics, no matter how much in the way of resources we give the system. We ran into limits of instance size with no options ahead of us, and we had to quickly figure out a solution.

Traditional methods of Prometheus scaling were not working well for us. We needed a consistent view of all of our metrics from all availability zones and all regions and found that federation wasn’t able to give us that due to the sheer volume of metrics. Since federation requires that the federating server store and compute data from all of its targets, we’d be relying on a similar vertical scaling issue like the one mentioned above. As Wish is largely a monolith at this point in time, we couldn’t split metrics based on the job or the owning team. There was no way to differentiate who was responsible for what. We did attempt a ‘tier-based’ splitting of jobs, where each ‘tier’ of the application would have its own metrics collection and storage. This meant that essentially a couple of scrape jobs unique to each application tier would be combined in one Prometheus. We found that this resulted in an imbalance of metrics between tiers due to some tiers having many more jobs than others. Additionally, the delineation about where a job should exist was sometimes fuzzy and we had to make compromises on where to have the job’s metrics stored. Keeping track of all of that was difficult. We investigated remote storage but found that it wouldn’t work for us at the time due to a number of technical and non-technical factors.

We were already using Promxy to allow us to have a consistent view that didn’t require a ton of extra infrastructure. Promxy is a project developed by Thomas Jackson that can scatter-gather metrics from multiple Prometheus nodes and stitch together metrics into one view. But Promxy wasn’t a silver bullet, and it couldn’t solve the scaling issues that we were seeing on Prometheus. We had to come up with a way to reduce the load and distribute those metrics better.

Design and Implementation

Based on a previous blog post from one of the creators of Prometheus, we decided to attempt a sharded Prometheus architecture. We realized that by utilizing the hashmod concept in conjunction with Promxy, we could have a sharded system that would allow us to split jobs between many Prometheus nodes while still ensuring a consistent top-level view of metrics for graphing and alerting purposes. Some small changes to Promxy would be required, but in the grand scheme of things, the additional work on that end was minimal. We built out an entirely separate Prometheus infrastructure that would scrape all jobs from all application tiers, and each job utilized hashmod so that Prometheus would only have a small number of systems to scrape for a particular job.

For example:

- job_name: "node_exporter"
metrics_path: "/metrics"
consul_sd_configs:
- server: "127.0.0.1:8500"
services: ["prometheus-node-exporter"]
relabel_configs:
- source_labels: [__address__]
modulus: 64
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ^(24|25|26|27|28|29|30|31)$
action: keep
- source_labels: ["__meta_consul_service"]
regex: ".*prometheus-node-exporter"
action: "keep"

In this instance, we’ve set the modulus to 64 to allow the address space to be split into 64 distinct shards, and this node (like all of its sister nodes in the same availability zone) is currently responsible for 1/8th of the total shard map via the __tmp_hash label. Here we have a regex set up that allows us to select multiple shards for scraping, in this case, shards 24 through 31. We utilized Consul KV and consul-template in tandem to inject values for the modulus and regex for each particular shard node. This allowed us to have a shard mapping that could easily grow alongside the number of timeseries in our infrastructure.

Results

The first thing we noticed over our old tier-based split was that query performance was significantly faster with the sharded setup. In certain cases, we saw up to a 3× improvement in query time, with 2× being more common. Since Promxy is a scatter-gathering system, it can dispatch queries to multiple shards at once and combine their results faster than the single tier-based Prometheus was able to return. That is partially because the data load on each individual shard node is much less, and partially because Promxy allows for query parallelization. In fact, our scrape load went from scraping somewhere around 1.2M samples/sec/node in the tier-based system to less than 200K samples/sec/node in the sharded system — a dramatic decrease in timeseries storage and scraping responsibility for each shard node. This allowed us to replace gigantic nodes with boatloads of CPU and memory with nodes that were significantly less powerful, which actually ended up saving us money. We were also able to use cheaper and less performant storage volumes, and that saved us even more.

We’ve also got the capability to expand this system further by changing both the shard mapping as well as the modulus if necessary. Promxy is able to stitch together metrics even if the shard mapping changes. If we want to increase the number of Prometheus nodes in the sharded system, we just remove responsibility for scraping a particular shard from one Prometheus and then give that responsibility to the new node. Since we’re using Consul to orchestrate all of this, it’s a straightforward change that can be done in less than ten minutes. Perhaps most importantly, this design bought us quite a bit of additional time to investigate and implement a remote storage solution. Many of those projects have matured quite a bit since this shard design was put into place, and we believe that remote storage is the future of timeseries at Wish.

Conclusion

Distributed scraping is a key part of the Prometheus scalability puzzle for Wish, and we’ve gotten a lot of benefit from putting this system in place. We’re now scraping a total of 17M samples/sec over the entire infrastructure, and our Prometheus infrastructure has been much more stable. Issues with queries being able to overload downstream Prometheus nodes have almost entirely disappeared because of the metrics distribution changes, and random Prometheus unavailability is no longer encountered. Most of the hiccups we have are centered around cross-region/cross-AZ communication; since this system requires coordination between many nodes, network issues or otherwise can cause problems. However, they have been minor at this point in time, and we do have extensive metamonitoring to allow the Prometheus nodes to monitor each other such that we’d know if a network partition or similar event occurred. Distributed scraping in conjunction with remote storage is probably the next evolution of this architecture at Wish, and we believe it should be scalable well into the future.

Thanks to Thomas Jackson, Vitaliy Sakhartchouk, Raine Medeiros, Kevin Long, and the Wish Infrastructure Tools & Automation team for their feedback and contributions to this post

--

--