Stay up to date: the cache flush queue processor system
Nowadays caching is an immensely effective strategy to provide smooth user experience while also greatly reducing resource consumption on the backend and thus saving cost for the business. The ad server at GumGum relies heavily on cache to serve ads under an advantageous latency which is a requirement for ad exchanges. Cached data can go stale if the data in the database is updated. For some data it is okay to be outdated for a while, but some of our data can not tolerate long periods of staleness. To help minimize such periods for our ad server, we built a cache flush queue processor system.
What is cache flush queue processor in the first place?
Cache flush queue processor, or what we would like to call CFQP, is a microservice we created to help our ad server maintain the latest configuration data such as ad delivery goals, ad restrictions, and so on. As the name indicates, it pulls cache flush messages from a queue and uses them to refresh the configuration data cache in the ad server. It is essentially a Groovy script executed in a containerized environment. CFQP is the main player of the system. The other components include a queue and a service discovery platform.
How does the CFQP system work?
The CFQP system can be depicted in the diagram below. The whole process starts with ad operation specialists, a.k.a ad ops, making configuration changes. The changes result in data updates in the database and triggers cache flush messages which are pushed into a queue. The cache flush messages are basically instructions telling the ad server what data needs to be updated. When CFQP sees messages in the queue, it retrieves them, parses them, and sends more specific instructions to a group of ad server instances whose data need updating.
How does CFQP know which ad server instances to bother? With the help of Consul. Consul is basically a service discovery platform we use to keep track of ad server instances, so CFQP simply asks Consul for a list of instances to refresh the cache. Some servers may be overloaded, or experience some other glitches at the time of receiving the cache flush requests and may fail to respond. In that case, CFQP just sends retry requests up to certain times.
History before Consul
We did not incorporate Consul into the CFQP system in the beginning. Instead, we wrote a script to keep track of our ad server instances. This script invoked AWS API to get a list of servers that were currently attached to the load balancer at a certain interval, and cached the result for use when receiving requests from CFQP.
Immediately, there were two issues with this approach. First, we had to keep calling the AWS API which has a rate limiter. In fact, the rate limiting errors occurred quite frequently. Second, the caches of the instances, even with a small TTL, went stale sometimes and caused missing cache flushes to some servers that started running during the interval. Both these issues caused some ad server instances to have stale data. Even worse, we couldn’t improve on both issues because it’s a catch-22. If we call AWS API less, we have to increase the result cache TTL; if we reduce the result cache TTL, we will call AWS API more often. As a result, we received a ton of alert emails, but with the help of Consul, the situation has completely changed.
How does Consul work for CFQP?
As described before, the AWS API invoking solution has 2 significant problems, which gave us many headaches and alert fatigue. That is why we turned to Consul which always gives us the latest list of ad server instances that need to be refreshed. We also don’t have to worry about sending too many Consul requests because we manage the Consul servers ourselves.
So how does Consul keep track of ad server instances for CFQP? Well, maintaining a list of servers involves adding new servers and removing dead ones, and Consul is only proactive about the removal part. Ad server instances need to register themselves to Consul at start-up so Consul can have records of them. Doing this avoids having Consul invoke AWS API and caching the instances as the previous solution — both problems solved!
Now Consul has a list of servers, but how does it toss out dead ones so the list doesn’t grow indefinitely? The approach we took was configuring Consul so it monitors the health of the servers at certain intervals and kicks out the unhealthy ones. Please note the health of a server here is not the same as the health of the web application used by the load balancer to decide whether to route traffic to it. In fact, if we have Consul monitor the health of the application, we will risk losing cache flushes while the application is temporarily unhealthy.
The time at which an ad server instance registers itself to Consul actually matters. The correct moment of the registration should be before the application starts up. When the application bootstraps, it loads all the data from the database. If we make the server register itself during the application bootstrap or after, the server may miss some cache flushes. Suppose at time
t1, all active ads are loaded, at time
t2, a new ad called WasabiTaco (you can tell I’m a foodie and have a weird appetite:-D) is enabled, and at time
t3, the server registers itself. Because Consul doesn’t have a record of the server when CFQP asks it for servers to add the WasabiTaco ad, the server ends up not having the ad in its cache, and we won’t know that until problems happen — a pain in the butt. Therefore, the registration has to happen before the application starts up.
A little note on the queue
The queue that CFQP pulls for cache flush instructions is an unordered queue as opposed to a first-in first-out or FIFO queue. This works because the cache flush instructions are order insensitive. If the same data is modified in the database with one change immediately taking place after the other, two cache flushes will be triggered, but they will have exactly the same instructions which look something like: go retrieve data D in table T. Therefore, which cache flush message is processed first does not matter because the latest data will always be pulled. Because the order doesn’t matter, both a FIFO queue and an unordered queue works, but a FIFO queue would incur a performance penalty, so we went with an unordered queue.
Issues with the system
The CFQP system is by no means a perfect system. There are some issues we have encountered.
First, there is more than one CFQP container, and all of them are polling the same queue. That means if two and more containers are processing cache flush messages that deal with the same data structure in the application, there is likely a race condition which would cause corrupted caches in the ad server. Luckily this didn’t happen a lot, and we added proper synchronization as a fix.
Another issue is when there are many cache flushes, especially big ones such as reading a whole table that has millions of rows, happening at the same time, we will have a high number of database connections and CPU utilization. When this happens, new ad server instances will take longer to launch or even error out. Again, this doesn’t happen a lot because despite the CFQP system being critical, the request volume is low as it’s our ad operation specialists who initiate the cache flushes. But to be just safe, we try to avoid big flushes (only refresh what is needed instead of everything) and create appropriate indexes on big database tables.
The cache flush queue processor (CFQP) system is a small yet critical system that ensures our ad server has the most up-to-date data. The system involves a containerized CFQP script, a queue to deliver cache flush instructions to CFQP, and Consul which keeps track of a list of ad server instances and provides them for CFQP on demand to update their cached data.
As a next step, we’re looking to improve CFQP for our ad server platform 2.0 which is an initiative to modernize and enhance the entire ad server system (which obviously includes the CFQP system). One of the changes we are going to make is instead of sending instructions to ad server instances to query the database, we will directly send the updated data to them. This will greatly reduce the number of concurrent database connections. Of course there will be challenges such as what to do with complex cache flush instructions that involve more than data retrieval and simple update, but I am confident that we will figure them out.