I’ll skip the background about the Twitter migration, and the space we’ve created at hachyderm.io because Kris Nóva tells it better.
As we started to see the need to scale, we also figured out we needed to know what was going on to make good decisions. enter prometheus and grafana!
Knowing what tools we had available was only the first step. In this post i’ll go through our current setup for observability and describe how it’s helped us understand issues and how to successfully scale to 10k+ active users.
Prometheus Exporters
There are a huge number of exporters available, each of which bridges some data source (logs, HTTP endpoints, etc) to ingestion into a time-series DB (the core of prometheus). Each exporter exposes an HTTP endpoint on some unique port that the prometheus service scrapes at regular intervals.
The set of exporters we currently use on all of our servers, regardless of type (POPs, compute nodes, and backends) are:
- prometheus-node-exporter
- prometheus-nginxlog-exporter
- nginx-prometheus-exporter
- prometheus-statsd-exporter
- prometheus-postgresql-exporter
On the backend that is also used as the central prometheus node we run
The backend node also runs the prometheus service that scrapes from all the others. One downside to this configuration is that if our central backend goes down, we lose visibility into the system. There are a couple of mitigation options: Run the prometheus service on a node of its own — this would have the nice side-effect of removing the prometheus and grafana overhead from the backend node — or run scrapers on multiple nodes. The latter adds some complexity to the grafana setup which I’ll come to later, so I think we’ll end up implementing the former at some point.
Many of the exporters work with default, or well-documented configurations, but a couple did require some custom work.
prometheus-nginxlog-exporter
This exporter extracts information from the nginx access log and out-of-the-box you get status code rates and traffic sizes. However, with a bit of work it will also give you response times.
The first thing to do is make sure that you’re logging the response times in the nginx access log. Add the following to the http
block:
log_format with_response_time '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'rt=$request_time urt=$upstream_response_time';
This defines a new log format with_response_time
which is almost identical to the default but with the addition of the request and upstream response times.
Then, in the server
stanza that defines your service, you can add:
access_log /var/log/nginx/<server>-access.log with_response_time;
Of course, you can just add this to the top-level and have a single access log with the response times in, but we serve multiple domains from the same nginx service and I wanted to make sure we could isolate the hachyderm.io traffic.
Once both of these are in place, you need to tweak the nginxlog exporter config to handle these fields. Our config looks like:
listen {
port = 4040
metrics_endpoint = "/metrics"
}
namespace "<service-name>" {
source = {
files = [
"/var/log/nginx/<server>-access.log"
]
}
format = "$remote_addr - $remote_user [$time_local] \"$request\" $status $body_bytes_sent \"$http_referer\" \"$http_user_agent\" rt=$request_time urt=$upstream_response_time"
labels {
app = "<service-name>"
}
metrics_override = { prefix = "nginxlog" }
namespace_label = "server"
}
Two things to note are the format which matches that in the nginx config, but also the addition of the app
label and the namespace_label
. This allows us to have the same metric name across namespaces as per the documentation which is important for our setup where we have multiple services running and want to be able to reuse our grafana dashboards.
nginx-prometheus-exporter
This exporter provides insights into HTTP request rates, and the state of nginx connections. It also includes whether the nginx service is up, which is what we mainly use it for. To integrate this, your nginx server stanza needs to include
location /metrics {
allow 127.0.0.1;
deny all;
stub_status on;
}
and the exporter config
NGINX_EXPORTER_ARGS="--nginx.scrape-uri=http://<your-domain-name>/metrics"
You want to make sure that the exporter resolves <your-domain-name>
to the local machine, so make sure your /etc/hosts
file includes:
127.0.0.1 <your-domain-name>
prometheus-statsd-exporter
The only bit of configuration needed here is to ensure that Mastodon is exporting the statsd metrics. In your .env.production
file:
STATSD_ADDR=127.0.0.1:9125
grafana
How you set up your grafana dashboards will have an element of personal preference, but I’ll take you on a quick tour of our setup in case it’s useful.
We have four private dashboards and one public; our private dashboards cover NGINX, Mastodon, Node, and Postgresql.
NGINX
We have a few global charts and then a set of charts that repeat by server. The global charts give us a sense of how the service is performing and if we need to dig in deeper.
This first chart is our go-to for understanding how incoming traffic is balanced across our global servers.
This one allows us to understand if we’re having any latency issues globally.
One of our charts in particular came in handy when we started having reports of 429s from EU users. Digging into our EU POP’s status code chart:
It is clear that we were having a ~5m repeating pattern of rate limit (429) responses from this node. This is not surprising if you read the Mastodon documentation. We don’t see the same on our other servers. It was a short jump from seeing this to realising what was happening and resolving it (which you can see at the right-hand side of the chart).
The final chart that we use to keep an eye on things is our response rate by instance.
Mastodon
This is a custom dashboard that we built based on the statsd export which gives us insight into the sidekiq queues. This is (we found out!) the most important indicator of how users are experiencing the site, and it was the signal we used — and continue to use — to tune the number of workers and threads assigned to sidekiq queue processing.
Node
For this, we just use the out-of-the-box Node Exporter Full dashboard from grafana.
Postgresql
We use an off-the-shelf dashboard from grafana.
What’s next?
At this point, we have a pretty good view into what’s going on with the various machines and services we’re managing that allows us to spot areas of improvement for hachyderm.io. These charts also allow us to dig in when we get reports of issues. As we continue to scale out and fine-tune our architecture, these will also let us tune the various bits of the stack.
As we bring up new machines, we only need to add to the central scraping config to populate all of these dashboards, which means we get visibility from day one.
As mentioned above, it would be good to move this stack off the central backend node to make things more resilient in the face of outages, and getting some insights from logs using something like Loki would be a good next step.
I hope some of this is useful to anyone starting to scale out their Mastodon instances (or any other services!). If you have any questions about it please feel free to reach out to me on hachyderm.io.