Varnish Sharding with Istio in Kubernetes
How to use Istio to transparently implement consistent Hash-based Load Balancing across multiple Varnish instances — sharding based on the HTTP request URI.
Varnish and Istio
Varnish is a popular HTTP cache used by many companies for their service backends. It serves as the technological core of some of the world’s largest Content-Delivery Networks (CDNs). Varnish can drastically reduce the time-to-first-byte metric and improve user experience for cacheable content, even when used in addition to edge delivery over a CDN. This is due to its ability to collapse (as RFC 9111 is calling it) or coalesce (as Varnish is often calling it) multiple same requests from CDN edge servers into one request for the service backend.
Istio is a popular service mesh often used in Kubernetes deployments for its traffic shaping, observability and access control capabilities. The core technology on the “data plane” used by Istio is the Envoy proxy, a highly (runtime-)configurable HTTP and TCP proxy.
The Problem
At our project, we were facing the issue of how to effectively implement a caching strategy using the free in-memory Varnish HTTP cache. In essence, there are some operational challenges when it comes to implementing HTTP caching:
- How to handle ever-increasing load for the Varnish instances? — Varnish itself, being extremely CPU- and memory-efficient, can handle large amounts of traffic before breaking a sweat.
However, at some point there is going to be the need to provide it with more resources (CPU, memory, network bandwidth).
This can be solved (to some extent) by vertical scaling (simply increasing the resources available to one Varnish instance) or by horizontal scaling (distributing the load across multiple Varnish instances — on multiple VMs/hosts/nodes). - How to achieve high availability (HA) for Varnish (or any service, for that matter)? The outage/unavailability of some of the instances—maybe due to the outage of a Cloud provider’s availability zone—should never render the whole content delivery unavailable. So, we effectively need multiple horizontally scaled Varnish instances in a “cluster” spread across multiple VMs/nodes spread across multiple availability zones.
- How to keep the effectiveness of the caches high? All of the Varnish instances should effectively act as one instance. Equivalent requests must not be routed across all of the instances, making a (potentially high-latency) backend request.
This is also important when using a CDN in front of our Varnishes. Many CDN edge servers will send the same request at the same time when geographically distributed users request the same content. If we were to simply round-robin (or better, least-request) load-balance these requests across all of our Varnish instances, a bigger fraction of them would incur higher latencies due to backend requests being made. Compare this to the scenario of employing only one Varnish instance (or routing equivalent requests to the same Varnish instance): The Varnish instance either has the response for this particular request cached and can serve it right away, or it will collapse/coalesce all incoming equivalent requests and serve the response to all of them after the backend server’s response.
Initial “Naive” Solution
If we consider the first two of the points above, we can come up with a first simple design for our Kubernetes deployment:
Here, we have user requests coming from arbitrary publicly routed IP addresses (generally referred to as “The Internet”). Each arrow in this diagram shows one potential HTTP request between two endpoints.
The CDN serves as our first caching and edge-acceleration layer. This CDN has multiple edge locations/servers, all of which fetch content from our (Cloud) Load Balancer. This Cloud Load Balancer will round-robin the requests to any of our healthy ingress controller pods. These ingress controller pods will then round-robin the requests (by means of kube-proxy iptables/netfilter) to any of the (healthy and ready) Varnish instances/pods, which in turn will request fresh content from any of the healthy and ready backend instances.
Let’s now consider three different kinds of requests, for example: /content/a , /content/b , and /content/c. In our case, this means that requests are of a different kind when their request URIs are different (the path of the URI and/or the query string of the URI). Neither response of any of these three kinds of requests can be used as the cached response for the other requests.
In the below diagram, a red arrow will be one kind of request (all of which can be served from a cache from the response of either of those red requests), and a green arrow as well as a blue arrow will be other kinds of request:
In the current (naive) design, the flow of requests can look like above. The CDN will collapse/coalesce some of the requests for geographically close users, due to them being DNS-resolved to the same CDN edge server IP address (or the CDN using one single IP address with Anycast routing). In this case, of the three red requests our load balancer will still get two red requests. In total, this means, our infrastructure has to handle four requests — two of which are of the same kind (i.e. have the same request URI).
When going further, our cloud load balancer will distribute these four requests (which we also assume to be four distinct TCP/IP connections) to all of our ingress controller pods.
Problems of the Naive Approach
Here, we see the first “violation” of our third bullet point from above: Two red requests are routed to two different Varnish instances. Ideally, we wanted requests of the same kind (or equivalent requests) to be routed to the same Varnish instance. In this scenario, the Varnish instance already holds a cached response for this request and is able to serve it to the caller right away. Or, if that Varnish instance does not yet have a cached response, it will only ask one backend once.
But in the case above, we ask two different Varnish instances, each of which will then ask one backend instance.
Another issue that we see in the above setup is that load was not optimally distributed across the backend instances. One backend instance got two requests at the same time, while another backend did not get any requests. It would be preferable to have an equal distribution of requests across all backend instances.
Desired Behaviour
Ideally, the situation should look more like this:
Here, while requests from the Cloud Load Balancer are still round-robin distributed to our ingress controller pods, each of the two red requests is being served by the same Varnish instance. Coincidentally, green requests are also distributed to the left-most Varnish instance. So, in order for the Varnish instances to receive an equation portion of the total requests, there have to be enough different requests.
There are downsides, though, of sharding Varnish in this way: Whenever one or more of the Varnish instances fail/crash, we lose those Varnishes’ shares of cached content. A big fraction of content would then have to be fetched from the backend services by the remaining healthy instances.
Ways of sharding Varnish “natively”
As outlined above, the desired solution requires routing same requests to the same Varnish instance.
In the blog article Creating a self-routing Varnish cluster, author Reza Naghibi shows a way to implement this kind of sharding by means of VCL code, without yet making use of Istio.
Naghibi approches the problem by hashing the request using the “hash” director natively supported by Varnish and having each Varnish instance know all other Varnish instances as additional backends. Here, requests that happen to hit “the wrong” Varnish instance would then simply be passed through to the correct Varnish instance. This latter instance then computes the same hash of the request and identifies itself as the one responsible for caching this request.
If we employed this strategy in our design as an improvement to the “naive” solution, then the diagram changes as follows:
The red request coming in to the Varnish instance in the middle would be proxied/passed to the Varnish instance on the left-hand side. And, due to hashing, the left-most Varnish instance just happens to be responsible for both the red and the green requests, in this case.
Making this Kubernetes-aware
This VCL-based solution requires each Varnish to know all (other) Varnish instances by their host names in their VCL configuration. That can be a problem in a dynamic Kubernetes environment with horizontal pod autoscaling and when instances may fail (become unready/unhealthy) for various reasons.
Shouldn’t there be a way to make Varnish “kubernetes-aware”? That’s precisely the question the authors of the kube-httpcache project set out to answer.
kube-httpcache itself is a Kubernetes API client that enumerates the Pods selected by a Kubernetes Service (by enumerating the <IP, port> pairs on the respective Endpoints/EndpointSlice Kubernetes object) in order to always know about all the Varnish instances in the cluster. Whenever this configuration changes, it will generate a new VCL file and reconfigure all Varnish instances in the cluster at runtime accordingly.
Downsides of this Approach
When looking very closely at both the previous diagram and the one in the Desired Behaviour section above, one difference becomes apparent: In the Varnish “native sharding” solution, Varnish instances need to pass requests to another Varnish instance when they receive requests that they should not be handling. This can add a little bit of extra latency.
But more importantly, it adds to the complexity of the Varnish installation and its configuration:
- We need to make Varnish itself Kubernetes-aware in order to know of all the instances in our current cluster.
- We need to write VCL code (as a template file) to implement the routing/proxying logic in Varnish itself
Additionally, what if we wanted the same sharding/routing logic not only for this particular Varnish cluster, but for other kinds of services as well?
Oh, and by the way: We still did not solve the problem of unequal load to our backend service instances. In the previous diagram, the left-most backend instance received two requests (and also two separate TCP/IP connections) from the same Varnish instance by means of iptables routing in Kubernetes.
Introducing Istio
After having seen all of the problems above and the service-specific solutions to some of them, it is time to introduce Istio as a Service Mesh into the mix and see how it can help us improve the situation by solving all of the above mentioned problems in an elegant and service-agnostic way.
Istio as a Layer 7 Load Balancer
Unlike traffic-routing in a “normal” Kubernetes cluster with the kube-proxy component configuring Linux’s iptables rules, which can only implement Layer 4 proxying and load balancing, Istio can act as a Layer 7 load balancer. That means, Istio knows what an HTTP request is and can perform routing based on HTTP request information such as the Host/Authority header.
This is an important and necessary property in order to solve the “unequal load” problem towards our service backend instances. It means that Istio is aware of what an HTTP request is and how many outstanding HTTP requests a particular backend (or “upstream” — as Istio/Envoy is calling it) is currently handling.
With classic kube-proxy iptables load balancing, whenever a TCP/IP connection is established from a client to a Kubernetes Service IP address, this connection will first resolve to one random Pod selected by the Kubernetes Service and then stay bound to this same Pod IP until the actual TCP/IP connection is closed. So, multiple HTTP requests going over this connection will always hit the same pod.
But where exactly is the problem with this “TCP connection” load balancing? If we consider HTTP/1.1, each currently active request will basically block its “carrier” TCP/IP connection and no two HTTP/1.1 requests can be issued concurrently over the same TCP/IP connection. So, if we only ever opened one connection to the backend server instance, we could always just only call one HTTP/1.1 request at a time and that server’s utilization would just be always at max one concurrent request.
The problem starts with more than one concurrent connection needing to be opened to a backend server, maybe because we want more than one request at a time. Here, iptables will basically always select a random Pod IP address for each new connection to the Service’s IP. It will not keep track of the number of currently active connections to any of the possible Pod IP candidates. So, when creating two or three concurrent connections, it is possible for all of them to resolve to the same Pod IP address.
Istio, on the other hand, “sniffs” the Host header of outgoing HTTP requests and then routes the actual HTTP request via a configurable load balancing scheme, regardless of which IP address the caller/client thought they connected to. The default scheme in Istio/Envoy is “least request”. That means that each Istio/Envoy proxy keeps track of the number of pending HTTP requests for each of the possible backend/upstream hosts behind each Kubernetes Service.
Then, for each request to be load-balanced, Istio/Envoy randomly selects two possible backend/upstream instances and determines which of these two is currently serving the least amount of requests originating from this Envoy proxy. So, in total, load is not globally-optimally distributed, since not all Istio/Envoy proxies know about the current requests of all other Envoy proxies; but it is still an improvement over a completely random distribution.
Other Load Balancing Schemes
The “least request” load balancing scheme helps us with distributing load somewhat equally across all our backend service instances, but does not solve the problem of routing requests optimally towards our Varnish instances.
For this, we need another kind of load balancing supported by Istio/Envoy: Consistent Hashing.
What we want is basically a hashing of the request URI, like we would have done in the Varnish-native VCL configuration approach above, but this time implemented in Istio/Envoy. Envoy basically supports four different sources of information for this request hashing:
- Hashing based on a named HTTP header in the HTTP request
- Hashing based on a particular query parameter in the query string of the request URI
- Hashing based on a particular cookie in the Cookie HTTP request header
- Hashing based on the source IP of the request’s IP packet
How to use the request URI?
Sadly, neither of these four sources of information directly reflects the request URI in the request line of the HTTP request.
At the time of writing, there are GitHub issues on the Istio repository that ask the same question. One proposed workaround is to introduce an intermediate HTTP proxy (such as nginx) which will add an artificial HTTP request header (such as x-url) containing the request URI of the original request. Then, Istio would be configured to use this header for hashing.
But luckily, Envoy also supports so-called “pseudo-headers”, which were initially introduced as part of the HTTP/2 specification. In particular, HTTP/2 introduced the “:path” pseudo-header. Envoy supports this :path pseudo-header also for HTTP/1.1 requests. That is great, because most often HTTP/1.1 is still the application transport protocol of choice between services in the backend.
A solution using this pseudo-header was then proposed in the previously mentioned GitHub issue: We can treat the :path pseudo-header like any regular HTTP request header and instruct Istio/Envoy to hash based on this “header”.
Istio supports the DestinationRule CRD to configure load balancing for a particular upstream. In our case, the relevant part of the DestinationRule spec should look like this:
spec:
trafficPolicy:
loadBalancer:
consistentHash:
httpHeaderName: ':path'Conclusion
If we include the Istio/Envoy proxy components into the Desired Behaviour diagram from above, the final solution becomes:
The Istio/Envoy proxies now have two responsibilities:
- They distribute the requests from the Ingress instances to the Varnish instances via consistent hash-based load balancing
- They distribute the requests from the Varnish instances to our backend service instances using “least request” load balancing
In our concrete production setup, implementing this Istio-based Varnish sharding resulted in a visible reduction of the p80 (dark orange) and p95 (light orange) latencies through the ingress controller pods as shown in the below chart:
Should you really use Istio for this?
But let’s be real: If sharding a Varnish cluster and doing least request load balancing were your only concerns then adding Istio to your infrastructure is likely not the right choice. There is always a way to solve isolated problems like these without adding the burden of having to maintain a service mesh like Istio.
- You want Varnish sharding? Implement it in VCL like shown above.
- You want least-request load balancing? Add an Nginx proxy in between your caller and the load-balanced backend service. Nginx calls this “least connections” load balancing, but it is basically the same as Envoy’s “least request” load balancing, since Nginx monitors the activity status of each connection. So, both Nginx as well as Envoy track the number of actual HTTP requests.
However, you should now have realized that you would also have to make this Nginx instance Kubernetes-aware, so that it knows about each (possibly dynamically changing) backend service instance and their IP addresses.
But there are obvious benefits of Istio/Envoy, too: You can solve many aspects at once with the same tool! And, in addition to the traffic shaping features that we were using here in the context of this article, Istio (or any other service mesh, for that matter) brings a whole lot more features to the table that you really don’t want to miss once you’ve become aware of them and have seen them in action, like:
- observability via service-agnostic metrics and access logging
- improved security via authentication, authorization and mutual TLS
- and, of course, other “traffic shaping” features like configurable automatic retries (for errors or timeouts), failure injection and circuit breaking
So, if you already use Istio for these features, then you might as well also want to stay with a single tool and implement hash-based load balancing with it. 🙂
