Configure Consul for performance at scale

Published in

Criteo Tech Blog

8 min readMar 17, 2020

Consul is used at Criteo at scale for more than 4 years and is one of the most important parts of our infrastructure: any downtime on it can damage the business significantly. With 12 datacenters (9 in production, 3 in preproduction) and more than 42000 agents, configuring Consul to work properly at such scale requires some expertise, analysis of incidents, and lots of patience. In this article, we’ll see how we improved performance and what the most important settings are to ensure the best possible experience for our users.

The fall

A few days before 2018’s BlackFriday (the most important business event for Criteo), Consul broke on our 2 biggest datacenters at the same time. We spent 12 hours to stabilize the situation and 6 hours being in a degraded but operable situation. You can get the full story in “Anatomy of a bug: When Consul has too much to deliver for the big day”. That was our brightest and last incident with Consul in prod and since then, we had no measurable outage with it. The downtime is very important because all discovery relies on it. Without Consul properly working, load-balancers don’t get notified of changes (see “Discovery with Consul at scale”), observability and alerting might miss some information (see “Mixing Observability with Service Discovery”), and it is impossible to launch new applications nor connect to databases, caches, and microservices. Thus, we take a lot of time ensuring our settings can deliver the maximum performance and reliability to avoid such a disaster.

Zero measurable downtime for 15+ months with Consul

In a 2 years window, we went from 2 incidents per week to 0 in 15 months. Most of the changes were performed in the Consul code itself (we did more than 80 merged Pull Requests), so you can benefit from it today, but many settings also affect the behavior of your cluster.

Our main Grafana dashboards checking the health of our 12 Consul clusters (42k+ nodes)

Scale Horizontally

On large infrastructure, the most important thing is to be able to scale horizontally, here is how we do it. I recommend you to read the “Server Performance” article first.

The key point to understand is that by default, Consul tries to give you the most correct result, meaning all your calls end on the Consul Server Leader, thus reaching 1 single machine and preventing to scale your infrastructure horizontally. So my advice is: Request data to any server instead and ensure your Consul servers are all properly functioning and not late compared to the elected leader.

Fine-tune your SDKs

See “Be a good Consul client”, the most important is to ensure all the reads of your clients use “stale” and with recent versions of Consul (1.4+), cached queries. Our C# SDK is not OpenSource (too many links to various internal projects), but for the JVM, we use https://github.com/rickfast/consul-client that supports stale requests as well as cache queries.

Not all SDKs do support stale parameters and some apps might be missing it, so I strongly encourage you to use “discovery_max_stale” parameter we added (Consul 1.0.7+), in our infrastructure, we use 5min, but it is safe to use greater values as soon as you monitor the cluster’s staleness.

Fine-tune your DNS

By default, the DNS performance of Consul is poor because each DNS call performs a call from the agent handling DNS to the Consul server. To have a good performance you have to ensure to perform only stale requests. The important settings here are:

Negative cache configuration with SOA, we use: "soa": {"expire": 86400, "min_ttl": 30,"refresh": 3600,"retry": 600} to avoid clients frequently requesting non-existing services (Consul 1.3.0+)
"enable_additional_node_meta_txt": false: Does not embed node metadata in DNS response, very useful if you have some large services (we have more than 500 instances for very heavy services, it allows to push more results in a single response)
"allow_stale": true: Allows scaling the load horizontally by increasing the number of Consul servers. This is VERY IMPORTANT as it is the only way to ensure all the DNS queries do not end on the same unique Consul server.
"node_ttl": "1m": We don’t need to be updated more often than 1min for node changes (such as IP changes, we do bare-metal and Mesos on our own infrastructure, you might end up with different values if you are using the cloud with many on-demand. But still, configure it).
With #5300, we added cache support for DNS. This is a game-changer as the requests for most frequent services are immediate instead of performing an RPC to a server, we use: "use_cache": true, "cache_max_age": "10s" (Consul 1.4.3+)
Ensure to configure the service_ttl settings properly. Since we added support for prefix-based TTLs (Consul 1.4.0+), you can for instance use “db-*” to specify a TTL for all your databases easily. Starting with Consul 1.5+, we added support for hot-reload for those entries, updating the values do not require restarts anymore, you can just reload the agent and have a fine-tuned list without degrading your uptime.

DNS queries latency 99th percentile probed with use_cache: true

Async refresh of ACLs in WAN setups

In Consul, if you are using old ACLs, the ACLs are stored in a single DC and other DCs will sync periodically based on a TTL.

For a long time, we had issues when links between our datacenters were still up but degraded. The solution for this was a patch to ensure that when an ACL was outdated, to perform a unique async refresh of ACL, without blocking the RPC call if outdated ACL was allowing it. If using another method than "acl_down_policy": "async-cache", then when the link is of poor quality, if several hundreds of identical requests hit your server, the server will ask the remote ACL datacenter for the new value of ACL. Since it was blocking, lots of requests about the ACLs were performed against Wan and performance did degrade quite a lot. Using the value “async-cache” (Consul 1.2.1+) ensures the cost to lookup ACLs is constant even when links between DCs are degraded.

Configure ARP Cache for all your agents

Consul agents use Gossip, so all agents discover all other agents and talk with them. In the early days, we had incidents with our network systems pushing more packets to discover IP addresses than actual useful network traffic. This was because the ARP cache (configured at 1024 on many systems) was constantly exhausted and all Consul agents were constantly asking the Mac Address for all IPs.

Configure security properly

Consul health checks can register scripts or programs to execute locally on agents. If you allow anyone registering such health checks, it opens the door for Remote Code Execution (RCE). Scripts are, however, very useful for performing advanced checks, so we added support for enable_local_script_checks for you to use script checks but not from HTTP API. If you still want to use HTTP API to register script checks, you can use the support we added (Consul 1.3.0+) to forbid writing access in HTTP APIs outside of localhost by using "http_config.allow_write_from":["127.0.0.0/8", "::1/128"] and to restrict performing changes to a local configuration on agents outside of local addresses. So prefer enable_local_script_checks to enable_scripts_checks.

Use ACLs everywhere for writes, it will avoid many issues (including fat fingers) and will reduce the possible increase of entropy in your datacenters.

Also to check

performance.raft_multiplier: set to 3 in our clusters
Have a in-depth look at https://www.consul.io/docs/install/performance.html
telemetry.prometheus_retention_time: “192h” (to enable Prometheus support if you plan to use it)

Monitor everything

We monitor our Consul clusters a lot with tooling from Hashicorp using Prometheus for which we added support in 2018. But we also perform some Blackbox monitoring, aka some probes running randomly in our DCs try to do the following:

Create a new service instance in the Consul cluster and read the time to see the updated value (service registration latency). We then read the difference between the Server who did reply (might be not leader) and compute the latency between servers every 5 seconds. This tells us the maximum stale value the cluster can have. Report it to Prometheus and alert us when the value is more than 1.5 seconds (call at night if more than 3s).
Create/update a key in the K/V, see how much time it needs to be propagated as a stale read.
Create alerts about duplicated nodes (duplicate node-id) and a few possible errors by parsing Consul servers logs.
See the changes/sec in the cluster per service. This gives a good indication of how your services do behave (we have a consul-templaterb template that provides you this information).

Per service QPS during a quiet period, show most requested services and health queries / DC

Fix everything weird in server’s logs

Every warning in logs of Consul should be understood by your operational team and possibly fixed (if you don’t understand the error, create a bug report). Don’t let entropy break your clusters, ensure everything is green.

Investigate reasons behind any drop in performance or reliability and be patient (it took us a few months to fix several issues, for instance, #3217 (found in early 2017, reported mid-2017) was actually fixed by our patches in memberlist #178 and #5313 in 2019!).

Limits

Watch limits carefully on the OS side, on our side, in systemd, we use the following line: LimitNOFILE=65536

If you are using Consul 1.6.3+ and using large templates or big usage of APIs, beware of http_max_conns_per_client (100 by default)

Videos to configure Consul Clusters at scale

Operating Consul at scale (HashiTalks ‘19)

A Consul Story: to 2000 Nodes and Beyond (Bloomberg)

Conclusion

While being a complex piece of software internally, Consul is a pretty easy software to administer being able to scale quite well on significant infrastructure. At Criteo, while being one of the most critical pieces of our infrastructure, it works at scale with 40k agents in 12 datacenters, allowing us to discover more than 260,000 services at any time without any downtime. With care regarding stale requests, it is possible to scale towards thousands of requests/second on a few servers 24/24 for months.
Be sure to read our other articles below on how to achieve good performance in your applications to get the best of your Consul clusters for your business.