What is Cortex?
My colleague Bryan recently pointed out to me that there’s nothing out there on the internet that explains what Cortex is. The best thing we have is the original, somewhat outdated design document.
Here, I’d like to remedy the situation, and provide a gentle but technical introduction to Cortex.
If your question is “What is Cortex?”, the answer is really very simple: Cortex is a horizontally-scalable, multi-tenant, Prometheus-as-a-service.
We’ll unpack what that means later. First, it’s time for a story.
At Weaveworks, we’ve been running Kubernetes in production for over three years. For almost all of that time, we have been using Prometheus as our primary monitoring and alerting tool. We love Prometheus, and particularly appreciate the way it works so well with Kubernetes.
In the early days, our Prometheus deployment was quite naive: we built our own image containing Prometheus and our configuration files and deployed it as a ReplicaSet. We made no provision for long-term storage. If Prometheus crashed, we’d lose our time-series data. If we upgraded the version of Prometheus, we’d lose our time-series data. If we changed our alerting rules, we’d lose our time-series data.
Writing it out like that, it sounds awful. Actually, the situation was surprisingly bearable. Most of the time when you look at dashboards or run a query, you are interested in only the most recent data. This is perhaps especially true in the early days of a new project, where so much is changing each day, making it pointless to look for trends or patterns over time.
Nevertheless, we did tire of constantly losing our metrics, so we decided to come up with some way to store them.
Now, the sensible thing to do would have been to investigate Kubernetes many options for persistent storage, try a couple, pick one, and then run with it.
We decided to do something different.
If we were having these issues, we reasoned, perhaps others might be too. We already had a service that helped people understand what was going on in their clusters, surely it makes sense to expand that to include Prometheus time-series data.
That would mean we would be providing Prometheus-as-a-service.
From there, the rest is easy.
We did not want to be in the position of having one Prometheus instance per customer. Our team had tried that before, and knew from bitter experience that such systems are a pain to operate. Hence, we needed something that was multi-tenant.
We also wanted to be in the position where we could readily scale the service to match customer demand. Prometheus is very scalable, but it’s best scaled by sharding. Running one instance per customer is a special case of this, and we didn’t want to do that. We wanted something that was horizontally scalable.
There you have it: a horizontally-scalable, multi-tenant, Prometheus-as-a-service.
But what does it do?
How Cortex works
It all starts with a Prometheus instance running on your cluster. This instance is responsible for scraping all of your services and then forwarding them to a Cortex deployment.
Prometheus then forwards all the samples it scrapes to Cortex, which stores them in a database.
As a user, that’s really all you need to know. Things get a lot more complicated from here.
What Cortex is
Cortex is made up of several parts.
At its core are the ingesters. These continuously receive a stream of samples, group them together in chunks, and then store these chunks in a backend database, such as DynamoDB, BigTable, or Cassandra. A chunk is a fundamental data structure in Prometheus 1, and is an efficient way of storing a series of samples for a particular time-series. The point of the ingesters is to allow this chunking process to happen so that Cortex isn’t constantly writing to its backend database, which can be prohibitively expensive.
A Cortex deployment can have as many ingesters as it pleases the operators, but generally no fewer than three. The ingesters are arranged using a consistent hash ring, keyed on the fingerprint of the time series, and stored in a consistent data store, such as Consul.
If you are running five ingesters, each ingester will “own” a fifth of the ring. However, this fifth is not contiguous. It’s not like slicing a pie into five pieces. It’s more like slicing a pie into a thousand pieces, and then each ingester claiming every fifth piece.
Why does this matter?
The part of Cortex that receives samples from its users’ clusters is the distributor. The distributor is a stateless service that consults the ring to figure out which ingesters should ingest the sample. It finds the ingester that “owns” that particular fingerprint, and then sends it to that ingester, as well as the two “after” it in the ring.
This means if an ingester goes down, we have two others that have its data. It also means that if an ingester cannot receive writes, we can still accept writes for the metrics it might have been responsible for. This aspect of Cortex’s design is heavily inspired by Amazon’s Dynamo paper, which is well worth a read.
You run a Prometheus on your cluster, it sends samples to Cortex, which are received by a distributor, which consults the ring, and forwards the sample to a number of ingesters. These append the sample to a chunk, until it’s time to flush that chunk to a backend database. What could be simpler.
Just as there is more to Prometheus than scraping metrics, there is more to Cortex. The project also contains a horizontally-scalable Alertmanager-as-a-service, a service for querying metrics (vitally important), and another service for running alert and recording rules.
Despite our intentions, Cortex is still not the easiest thing to run. However, many people and companies other than Weaveworks are running Cortex in production today, and contributing back to the code base. It’s gone from a capability within our product to a genuinely community-run open source project.
It has personally been very exciting to see Cortex improve over time, becoming more performant and more reliable as we learn more about how to run it at scale, in production. I can’t wait to see what happens next.