I have been involved with building a middleware platform (wso2.com) for last 6 years. Most our customer deployment are clusters they always need high availability and sometimes scalability.
Most our servers are stateless ( that is they keep their state in a data base). Two notable exceptions are our CEP server and message broker, but let’s ignore them for now.
Setting up a cluster of stateless servers is easy ( or so we believe). We setup the servers and put a load balancer (F5, HA Proxy, mod_proxy, Nginx) to distribute the load. Unfortunately, we need little bit more. We need to handle security, sessions, throttling, and artifact deployments across the servers in the clusters.
The Problem: Token Invalidation
Let’s consider security. User’s login, get a token (e.g. OAuth), start a session, and keep invoking servers. It is too expensive to hit the database for token per each call, hence we have to cache the tokens. Now if the token is revoked, we have to remove the token from the data base and then invalidate the cache in all the nodes. This is the token invalidation problem.
There is a magic answer to this problem. A distributed cache. When you invalidate an entry, it will take care of talking to all the nodes and invalidating the entry.
Distributed Caching Woes
All these is good. Everything worked well for some time. Eventually, nightmare is coming. Distributed cache systems are brittle. We see the system unstable, frozen, or dead.
Let me explain why. When I invalidate the cache, distributed cache has to know what nodes have the entry ( or send a message to all the nodes). To do this, it needs to know what are the nodes in the system. Trick is that all nodes in the system have to agree on what nodes are in the system. This is distributed consensus. To solve the above problem, you need to maintain a group communication system. ( read about Ken Birman and Virtual Synchrony).
Maintaining a membership group means lot of trouble. Let me list few.
First problem that that they do not scale very well. Running this with more than five nodes is tricky.
Second problem is that when a node has failed, other nodes will detect that and remove the node from the group. ( there are several ways to do this, and they usually involve heartbeat and a consensus algorithm). Problem is that there are lot of false positives.
- When node is loaded ( say load average is twice number of cores), it does not respond to heartbeats fast enough and other nodes think that the give node has failed. They will rearrange groups and reshuffle data.
- In the world of VMs and Docker, networks are not very reliable. Network itself can have latency spikes that exceed heartbeat limits. Nodes also die more frequently.
Now there are few problems. When a node is presumed dead, other node reshuffle the data. The presumed failed node does not know it has failed and continue to work. After few minutes, it will join back. When it joins back, system need to reshuffle the data again. If you are unlucky, then you can run into the situation where nodes keep coming and going where the system will continue to repair itself.
If you want to recreate this problem, pick one of your favourite distributed cache, load about 1G data, and then run some cpu hogging processes in a those servers.
You will tell me that I have to make sure my server is not loaded. Well that is bullshit. This throws “stability” and “graceful degradation” down the drain. Problem is that even a short term load spike lead to system level reset and repair. Also, last thing I need when my server is loaded is to start repair and data shuffling.
It is possible to run a distributed cache without trouble if you do lot of monitoring and hand holding to make sure nodes are not loaded. However, it is hard to justify that kind of attention for small and medium size deployments.
Furthermore, the serious version of the above problem is a split brain. That is network partition the nodes to two clusters. Then system has to make sure only one half is working ( otherwise, you might end up with conflicting updates and stale data).
Note that the key issue is not the consensus algorithm, but the data shuffling after detecting a failure. When nodes are loaded, there will be false failures detected, which will lead to data shuffling, which will put more load ion the server. This kind of positive feedback loops are a recipe for disaster.
I do not see a solution to high sensitivity of the distributed cache to load spikes and network latency spikes. In the world of VMs and docker instances, and SDNs ( Software Defined Networks), performance and latency variability will get worse, not better.
At this point, you feel like the hermit from that story who brought a kitten to chase out mice and end up getting a wife and having to go for a job. Remember, all these started because we wanted to revoke the tokens.
I believe distributed cache is the wrong solution to this problem. I do not tell this lightly. In span of last five years, we have used EhCache, Infinispan, Zookeeper, then Hazelcast. Nothing could hold off in real conditions. They can pass a week long release testing, but fails at customer sites after running for months.
Same problems happen if you try to do throttling, session replication, or artifact deployments using distributed cache. I will not dig in this post to keep it shot.
There are other ways to solve the problem.
Solution 1: Use Local Caches and Cache Timeouts
Simplest solution is to use a local cache, set a short a cache timeout, and do nothing. If the token is revoked, it will get flushed after the timeout.
I agree this does not work for all the cases. There are crucial use cases where you will need immediate revokes. However, it is worth asking does your use case worth spending additional $100k/year. Chances are you will spend more than that solving the above problem.
Solution 2 :Explicitly invalidate the Cache Entry in each Local Cache
Idea is that, when a token has revoked, we talk to each server and explicitly invalidate the cache. We need to add to each server an service API to invalidate the cache. However, adding an API is relatively straight forward.
Unfortunately, we are not done. How did you get the list of current servers? How about getting it from a distributed coordination framework such as Hazelcast or Zookeeper. It is possible. However, if you did, then you will have earlier troubles minus data shuffling. Actually, data shuffling is one of the hardest problems here and without that you might be good.
On the other hand, there is a simpler solution. We can give each node a list of other nodes in the cluster as configuration. This is much simpler. However, then you will have to restart the cluster if you want to add a new node, or have some way to refresh the node list at the run time (e.g. put it to a config file, get server to check it every 15min, and use rysnc to update it in all nodes when required).
Alas, our difficulties are not yet over. Nodes in the list may have failed, or not responsive, or network has partitioned. Let’s discuss solutions in the next section.
Solution 3 :Explicit invalidation with reliable delivery
When we try to revoke the cache entry and if a node has failed, then we have several options.
- We can retry, until the cache timeout has reached with some wait time in between. This should be OK as long as the number of invalidation are relatively small.
- We can use a persistant messaging system ( like WSO2 MB or ActiveMQ) to deliver the invalidation messages. Then, even if node is not available, the node will pick up the invalidation message when it come back. This solution is very stable. However, this means you need highly available deployment of the messaging system. Given that load on messaging system is small such deployments are well understood.
Solution 4:Use Session affinity and Explicit Invalidation
If you already use session affinity, then there is a simpler solution. On that case, the cache entry will only reside in the node where that client is bound by session affinity. To invalidate, send a invalidation request with client token set as the session, and the invalidation request will be routed to the only node
Distributed Caching is often used as a solution for cache/token invalidation. Although it provide solution that looks easy, we have seen lot of instability. Resulting system is brittle and very sensitive to performance spikes. This kind of complexity can hardly be justified for small and medium size highly available deployments. I am proposing that we should use a simpler alternative methods instead of distributed cache for this problem.
If you enjoyed this post you might also find following interesting.
- Mastering the 4 Balancing Acts in Micro services Architecture
- Multi-tenancy after 10 years of Cloud Computing