Making Our Backend Resilient

Using Flask-Caching and k6

Published in

The Klinify Blog

11 min readSep 16, 2021

The Battle Of Refreshes

I knew something was wrong in production as soon as my phone started buzzing like crazy at 10:55 AM. The AppGateway had missed a few HealthChecks and thought our AKS Backend was down. We’d seen this a few times but never figured out the root cause. The outage would last for a few seconds and recover automagically. We designed Klinify’s app with offline capabilities — to gracefully handle network snafus — so this was usually not a big deal. Usually…

AppGateway took 9 minutes to recover that day. The stack would synchronize to a consistent state, but we hadn’t accounted for client behavior. Over the past few months, we had unintentionally conditioned the users to refresh whenever the offline banner appeared. And that’s exactly what all our users did at 11:04 AM.

The Frontend made a few heavy API calls when the user first logged in — or when they hit refresh. These endpoints pulled a lot of the data from CouchDB (and Postgres). This became our bottleneck when the number of requests spiked.

Services would return a 5xx if the database took too long to respond, and the Frontend would retry later. Since the database can’t tell that the result is not required anymore, the process kept on chugging. Contention for resources increased and responses became slower as more requests piled on. That made the users impatient and refresh even more. A positive feedback loop created the perfect storm.

The next 90 minutes was an adrenaline-filled, all-hands-on-deck fight to extinguish the fire. It was the first major incident since I joined the team. It was not a fun day.

The Aftermath

The postmortem found shortcomings in our Disaster Recovery Playbook and our Stack. We spawned multiple initiatives to minimize the risk (and impact) of disruptions. These included:

Projects for better Observability during crises.
Plans for regular Disaster Drills.
Scalability improvements in the Frontend/Backend/Infra.

I was in charge of Backend scalability. The rest of the article is a summary of how I achieved that over the next month.

Poking Around

The first order of business is to analyze logs to understand what exactly happened. We easily found some areas for incremental improvement — but that wasn’t enough. We needed a big-ticket item to exploit the Pareto Principle.

Luckily, we had a testbed ready. We were dealing with large load times before the incident, and we knew that the largest clinics strained our stack. We had prepared a dummy clinic with a large dataset a week earlier, hoping to reproduce the problems in the sandbox.

Analyzing system dynamics after spamming refresh pointed out 3 egregious endpoints. 2 of them pulled a lot of data at the start and the other polled every 2 minutes. All 3 of them spent significant time waiting for CouchDB views to respond. None of the underlying data changed often. All the engineers screamed “CACHE!”

Caching CouchDB Views

So how do we cache CouchDB views? Simple, in theory. We knew which CouchDB views were the main culprits. All we had to do was maintain a {View+Etag: response} map somewhere and serve from it when the ETag was unchanged. Unfortunately, CouchDB seems to have dropped support for ETags on views.

My colleague, Adnaan, dived into the docs and figured that we could ping a view with update_seq:true, limit:0, and stable:true to get a sequence. This sequence would only change if the result of the view changed! The ping took a fraction of a second so it was feasible.

Caveat: You might still get multiple update_sequences without a change in data. The exact number depends on how many nodes you have and how the cluster is set up.

Flask-Caching

Having encountered subtle caching bugs that were very hard to get right, we were wary of such a solution. You’d want to:

Control what you cache. This is to deter Devs from writing non-performant code and use cache as a crutch.
Control the activation/deactivation of the cache in production. This may be due to bugs or outages on different parts of the stack. This is especially valuable during the first few days in production.
Make buggy-code-produced cache entries stale immediately after a new rollout. This has to be localized to appropriate granularity (i.e. code that doesn’t touch the cache shouldn’t make entries stale when changed)
Ensure the keys did not collide. This would be catastrophic since one clinic’s data would leak to another!
Ensure the Frontend could hint it wanted recomputed results. Make sure the Backend doesn’t comply by default because that would create a DDOS risk.
Ensure the cache-miss penalty isn’t too high.

I wasn’t going to write a lot of clever code and sacrifice maintainability, especially if data consistency bugs came back to haunt me soon. I held out for a week. Then I found Flask-Caching.

It was simple to use and an MVP took a few lines of code (and an extra cup of coffee…okay, maybe two). I carefully documented the architectural changes, got rid of the rough edges in code, and had a production-ready version soon.

The end result was something like this:

@cache.cached(
    make_cache_key=cache_key_constructor,
    unless=cache_filter,
    source_check=True, 
    response_filter=response_filter
)
def function_i_want_to_cache(*args): 
		...

The workflow is simple (and mostly handled by the library):

When a request comes in, check the cache only if it passed cache_filter(). This allows us to define the criteria on *args and the pod's environment. The Frontend can ask for non-cached responses with *args if needed, and we can easily en/disable the caching layer using a ConfigMap entry.
We use cache_key_constructor to... construct a key. We set source_check to ensure relevant code changes would make the cache entry stale.
Check Redis to see if the key exists.
If true, send the value back to the Frontend.
Else, get a response from CouchDB, store it in Redis, and send the response back to the client. The cache-miss penalty was well under 100ms in our infrastructure.

Note: You lose the ability to stream the result back immediately.

The rest of the custom code abstracted the caching layer away from the Devs. They could import KlinifyRedisCache and not worry about things like setup. But that wasn't the only reason...

Tracking Performance

Flask-Caching comes with several classes for initializing a cache — FileSystem, Redis, Memcached, NullCache, etc. As you’ve already guessed, KlinifyRedisCache is a wrapper around RedisCache. Now, how do we know our cache was useful? We added a few lines of code so hits and misses were visible in logs. The "key" metadata would also help us gather statistics about the nature of hits/misses.

class KlinifyRedisCache(RedisCache):
    def get(self, key):
        result = super(KlinifyRedisCache, self).get(key)
        if result is None:
            LOGGER("info", {"message": "Cache Miss", "key": key})
        else:
            LOGGER("info", {"message": "Cache Hit", "key": key})return result

Moment of Truth

We didn’t find anomalies on our test cluster for some time, so we went live (using a canary just to be safe). The hit-to-miss ratio was wonderful, and we were already seeing better Backend performance due to reduced load on CouchDB. App load times of big clinics were slashed by 90% on a hot cache. We saw the biggest gains when a clinic opened up all their computers at the same time — a common scenario in the morning and after lunch.

We tuned the timeout to be less conservative to get a higher hit rate. We had to ensure that Redis was not overflowing during the worst-case spikes. So we pulled out our napkins, did some Maths (see what I did there? ;), and provisioned extra capacity. A healthy system could absorb spikes much more effectively now. An unavailable/slow system, at its worst, would only degrade performance to previous levels. Win!

If you would like higher hit rates, you could modify this approach to refresh the key expiry on hits (like an LRU scheme). You need some custom code on get()

A word of caution here. Make sure overflows behave the way you want them to behave.

In some cases, overflows are fine because they would only degrade performance for some users. In others (depending on your eviction policy), a small overflow may degrade performance for everyone and miss the point completely!
If you’re using your Redis for multiple purposes and one of the workflows cause an overflow, it might affect other services too. Make sure you segregate your mission-critical datastores from your good-to-have caches. Use some extra money and spin up separate instances if need be.

Don’t Forget the Problem Statement

The dopamine hit of an icky release going well was very welcome after weeks of struggle. But the work was only half done. We had to prove that this solved the original problem statement: get better spike protection on the Backend. We had to simulate a surge.

Our testing Infra was nicely proportioned to X% of production limits. A crude way of measuring the outcome of the changes (without causing a production outage, of course) was to simulate an X% spike here.

We’ve heard of Chaos Engineering, but we weren’t familiar with the tooling. In any case, we wanted the current set of improvements in production before we started experimenting.

Simulate A Surge

First, we had to set up Y dummy clinics. You can check out “Aside 1” at the end of the article if you’re interested in that. Secondly, we had to hit multiple clinics from multiple agents. We thought Selenium was the answer. The following made this infeasible:

My laptop didn’t have enough RAM.
Requests were coming from the same computer, so it wasn’t really a distributed test.
We needed to write a lot of code to make it distributed and gather the necessary metrics.

Procrastinating for a couple of days, I came across https://k6.io/ and was trying it out for fun. I recorded a user session using their browser plugin and was able to ramp up to a stress test — with hundreds of virtual users — in MINUTES! Needless to say, we subscribed to their premium offering.

Over the next few days, we parametrized the k6 scripts and imitated Frontend’s behavior using parallel requests and barriers. We reached out to the k6 support team when we hit limits, but they were very helpful and pointed us to workarounds.

k6 allowed us to iterate fast. We could rollout a small change, run a stress test with preconfigured scenarios and have hard figures to measure what difference the modification made — all within 10 minutes. It was granular too — we could see how much time endpoints took, which error-ed out, etc. k6 also plots key metrics over time as a nice summary. Here’ an example screenshot:

The experiments showed that the caching improvements had indeed improved spike absorption by 80%. But that wasn’t enough to meet our internal goals. Over the next week, we followed up on some of the incremental improvements we’d discovered earlier. We closed the project after they took us to 95%. You can check “Aside 2” for a summary of all the little gotchas.

Takeaways

I hope this article was a simple demonstration of Flask-Caching usage. I hope it also convinced you that measuring the effectiveness of your solution with stress testing tools like k6 helps you iterate quickly and focus your attention.

There are a few other points I’d like to highlight:

Persistence works. Most of the days, I was working hard and was unable to find cracks I could hack through. Most of the insights came in bursts and the bulk of the work took 3 days. The rest of the days, it was grit. Stick with it, you’ll get it done.
Procrastination works. We joke about how “laziness is a prerequisite” for joining Klinify, but taking your mind off things and goofing around a bit works wonders for creativity. I discovered 2 terrific-but-simple tools while I was “exploring the space” instead of writing code. Writing thousands of LoC to “get shit done” may sound glorious, but it’s not always the smart choice.
It’s important to tackle the problems with the largest ROI first. However, as “Aside 2” will show you, lots of small changes can snowball too.

Aside 1: Replicating Y Clinics

This was the most frustrating part of the project, and the bit I learned the most from.

Here’s what I was trying to do. Since we had a dummy clinic with a lot of data, I’d use CouchDB’s replication API to set up clones. Sounds easy: write and run a few scripts for the replication and necessary infrastructure tweaks (adding DNS records, for example). But it wasn’t:

Replicating Y clinics took a very, very, very, very long time.
The CouchDB nodes kept going down for some reason. My colleague, Lip, showed me around some CouchDB admin stuff and let me figure out the rest (even though he knew the answer). I used journalctl for clues in the logs. Apparently, I was overwhelming the nodes by poking too many not-yet-generated views too fast. This took disk usage to 100% and crashed the CouchDB service. I hadn't given it enough time for view compaction, AND we had global-changes up unnecessarily. Thankfully, global-changes is no longer enabled by default. If you're using an older version of CouchDB and you don't use global-changes, disable it to save a lot of space.

The wonderful thing about CouchDB is that I could du -a /var/lib/couchdb | sort -n the files, and remove a few large files to get the node up again. It would resync with the other nodes automatically! Having made the mistake 3 times, I paced the replication, poked the views, and triggered view compaction on the nodes. I cobbled together some scripts and babysat the process for 2 days.

Aside 2: Secondary Improvements

Here’s a summary of gotchas that were fun to dig up and satisfying to fix!

CouchDB

You should know the limits of your systems and understand behaviour near those limits. It’ll help you avoid fires and hours of stress. We followed the recommendations at https://docs.couchdb.org/en/latest/maintenance/performance.html.
If your timeouts aren’t coherent, you’ll be left scratching your head. And don’t forget your fuzz factor! Your upstream timeouts should always be higher than your downstream timeouts. We changed the http-timeout in our uwsgi configs to match the AppGateway timeout we had configured.
You have to hit the sweet spot in granularity. We parallelized a few big Database requests by breaking them down into smaller chunks.
Ask whether small things are building up pressure on your system. We increased our changes-feed longpoll duration significantly after analyzing the logs to see how often it hit a timeout. This would reduce the number of requests to the database, therefore reducing load. As a nice side effect, it reduced our log volume significantly — making for easier log analysis and saving us some $$$ too.

Backend

Is your stack stuck in the past? We moved to HTTP/2. This allowed multiplexing and avoided the max-connections limit issue that our users faced when opening many tabs.
Do you fight your tools during the worst times? We set up a comprehensive set of dashboards to monitor the Backend system dynamics. This enhanced monitoring capability helps us during normal operations, crises, and rollouts.

Aside 3: From a Frontend Engineer’s Perspective

We could cache results in the Frontend to reduce the number of Backend calls made. The Frontend could track the changes-feed for clues of a stale view and make an API call when needed.
We could tweak the backoff strategy of retries for heavier endpoints. It is important to make sure your retry strategy doesn’t make the problem worse. This can happen if it doesn’t have a proper retry spacing, and if it retries ad-infinitum.
We tweaked UX issues in the Frontend to reduce the perceived app load time, and thus reducing the tendency of spamming refresh.

Story originally published at: https://medium.com/dabbler-in-de-stress/making-our-backend-resilient-37fbc115bad