Apollo South

Saar Yahalom
Remitly Israel (formerly Rewire)
6 min readJul 18, 2018

Apollo is a nice framework for consuming GraphQL in the browser, it is still relatively young and a bit rough around the edges. When we tried to use it in a system with a few hundreds of updates per minute things started to go south. This article covers the current problems with Apollo design, and how we addressed them in order to reach a working solution.

A bit of a background

Our system handles thousands of international transactions per day, where each transaction is built from a set of microtransactions. Our operation team uses a homegrown back office system that monitors and control the different aspects of these transactions. From outlier handling to simple wrong details handling.

The back office is built on React and recently we started testing porting our REST backend into a GraphQL one. The porting served mainly as a research project that allows us to experiment with the technology in-house before committing.

System outline

This is a simplified outline of the back office system. A simple GraphQL endpoint that interacts with a Database and a live Pub/Sub system directly. The Apollo client running in the browser sends the requested queries and subscriptions.

As mentioned before we are talking about a 5–10 QPS rate where most of the traffic comes in from the subscriptions.

In the beginning

Everything went smoothly when we tested the system on our sandbox environment. We basically switched our REST + Socket.IO architecture with Apollo. The update was relatively trivial and we enjoyed the additional benefits of having an explicit schema, having the client define the requested data and sending over the wire exactly what was needed. We replaced our custom cache implementation with Apollo built in InMemoryCache and persistence support.

CPU usage was the same and overall memory usage was comparable.

Going live and … south

Things started to misbehave pretty much from the get-go. Initially, we would start with a fresh session everything worked as expected, but after a few minutes, the UI would freeze. When opened on multiple tabs the cache update suffered from a race condition between the different tabs which in turn also caused in some cases data loss and unnecessary querying.

What’s going on? What are we doing wrong?

Looking under the hood

Let’s first categorize our problems:

  • UI Freeze / High CPU — Hints for some long synchronous operation or very high congestion on the event loop.
  • High memory — Hints for too much data is being directly held in the main memory or not being released properly.
  • Data loss and race conditions — Hints for classic multi-threaded / multi-process related problems.
  • Out of space — Hints that the browser allotted space for storage has been completely consumed, and similar errors.
  • Redundant work — This is a second-tier problem of lack of communication across thread and process boundaries causing data to be required at each thread/process.

UI Freeze / High CPU

We started off by profiling our app and searching what clogged the CPU. We quickly came up with a few suspects:

  • writeResultToStore - Normalizes the query data and stores it in memory. Synchronous.
  • persist - Stores the cache data in the browser cache (defaults to localStorage). Asynchronous (made synchronous by the use of localStorage)

So basically we were making a lot of updates to the Apollo store and requiring it to serialize its state on every update in order to store it in the cache. Well simple enough to fix. We moved to persist the data on 10min intervals and changed the localStorage option into localForage so we would enjoy a true Async operation of storing the serialized store into IndexedDB.

That UI freeze problem was gone, still, the CPU was higher than we would like but manageable.

We thought about making an async version for writeResultToStore but Apollo's design is synchronous all the way through the different layers until the data is actually being normalized and saved. This is a classic design problem and it is hard to address as it requires a massive refactoring in order to allow an async alternative.

For example, consider performing the normalization process using a web worker. You can then enjoy a parallel normalization process which is more effective and will not affect the main event loop. Currently, it is impossible to model this using the current design.

The best you can hope for is to move the complete Apollo client into a web worker and remove its operations from the main thread.

High Memory

Our web app works with about two weeks worth of data, which in our case amounts to about 30K objects which consume ~90MB. Not a real problem for a desktop browser and not even for a strong smartphone. Please notice that we are not displaying these objects all together on the screen. We are manipulating them in different ways to show active tickets, necessary user info, and transactions states.

Our memory consumption more than doubled. We did not delve fully into this issue because we had bigger problems to solve, but the main direction we planned on researching was the normalization process that Apollo does in order to represent the data internally. The normalization process causes objects to be sharded into types and then reassembled as full objects. Basically, you have the same objects stored twice in memory in different representations.

Data loss and race conditions

When opened on multiple tabs the website basically runs on separate processes sharing the browser caching mechanisms such as localStorage and IndexedDB.

If you remember we want to persist the in-memory store every ten minutes or so. But the in-memory store is not synced in any way between the tabs. In fact, because the different tabs run in disconnected timelines their respective stores are in very different states can vary significantly. In essence, we had multiple sources of truth.

For example, if an operator queried a user in one tab, and looked at transaction monitor at another tab we now have two stores with different data. The first store only has the user data and the second store has the transactions data. Now when the store data will be persisted and saved it will effectively run over the last version that was saved and only one of these versions would win. We are not getting the benefit of caching the data from both stores. Things get even more complicated when the similar data from different tabs gets cached and makes it appear as if data was lost in the process.

This is actually a big problem when you design a framework you have to take into consideration the different ways your customers will use it. Opening a website in multiple tabs is not an edge case.We expected to find some kind of synchronization/locking mechanism in Apollo. Unfortunately, we did not find any.

Usually, in a data-centric system, you desire to have a single source of truth. This means that you have a place that you can rely on in order to tell the data exact state. In most systems, the truth resides in a database. When working with multiple processes you have to design a local source of truth that you can rely on. So when things are not clear or go bad you can throw away the current version and fallback to the last known local truth.

Our initial solution was to use TabElect in order to elect a leader between the open tabs and only have that leader persist the data. This helped save the data consistent and pick a single truth. In addition, we added a few more extra steps to make sure that every tab will now issue the most common queries so the most used data will be cached.

This helped mitigate most of the problems but introduced a whole set of redundant queries that we had to address as well.

Out of space

As it turns out the Apollo store is being serialized into a single JSON and stored as a huge string. When used with our data this caused the string to get quite big.

How big? Bigger than the limit of a single (key, value) pair in localStorage and to our surprise bigger than a single value that can be stored in an IndexedDB field.

Now, this was a show stopper. If we cannot cache the data we are far worse than our previous custom design that managed to cache the data without a problem.

This single problem actually required us to stop using Apollo as a caching mechanism and to only use it as a smart network fetcher. Basically, this means that we have to write a custom robust caching mechanism around Apollo.

Redundant work

Multiple tabs also introduced a lot of redundant work. This is not directly a problem with Apollo but rather a problem of running multiple symmetric processes. Each process needs to go through the same actions in order to be able to display the data. Apollo increased the severity of this set of problems by consuming more CPU than our original version and by causing more redundant queries to be sent out.

--

--