The Curious Case of the Missing Cell

Published in

The Airtable Engineering Blog

13 min readMay 13, 2022

We recently rolled out a performance optimization we call View Projection that significantly improves load times for large tables. Here’s a graph showing the load time percentiles for our largest bases drop after it was rolled out:

Table load times for large enterprise bases

This, however, is a story about an elusive bug that was holding back the rollout of View Projection and how it was found. In this post, we’ll cover the:

Characters: a brief introduction to our architecture, services and abstractions
Investigation: how the bug manifested and our attempts to pin it down
Epilogue: some of the learnings we gained by the end

Characters

Worker

The “worker” is the process that executes all operations on a given Airtable base on the backend. So if multiple users collaborate on a base, all of their requests will be sent to a single process which executes them one at a time.

Realtime

The contract between Airtable’s backend and the client is that the client loads some base data at a given time and subscribes to updates from the realtime service to keep that data up-to-date. The realtime service uses a websocket connection to push updates to all clients collaborating on the same base.

Sometimes a client can fall behind after getting disconnected from realtime. This is fixed by a reconnection process called “recovery” where the backend sends all the updates that the client needs to apply in a loop to catch up with the worker.

View Projection

Before View Projection, the above contract for cell data was at the table level. That is, a client would load an entire table at a time and when one user made a change like updating a cell, the realtime service would send that update to all other clients actively collaborating on that table, thereby making them consistent with each other.

With View Projection, the contract is changed to be at the view level instead. The view defines the subset of cells (a “slice”) that the client needs to render everything that the user can see. When the view is first loaded, only cells in that slice are loaded, which makes the view render faster.

Operations that change the dimensions of this slice by adding more cells (such as unhiding a field) will now need those extra cells to be “backfilled” — i.e. before an update involving the new cells can be executed, they must be sent to the client.

The backend knows which cells to backfill because clients will subscribe to updates for specific views when it loads them. Clients unsubscribe when they unload views and the backend will expire a subscription if its client hasn’t accessed it in a while.

Cell data tracker

Since View Projection changes the contract between the backend and the client, we wanted to detect any bugs where we failed to honor that contract, either by failing to backfill cells when we should have, or by accessing cells for which we had no valid subscription.

To address this, we implemented a “cell data tracker,” inspired by tools like AddressSanitizer, which keeps track of which cells are loaded and warns if a cell that has not been loaded has been accessed.

The investigation

In the early stages, we used cell data tracker warnings to catch and fix a variety of bugs. We gradually rolled out View Projection to our most frequently used internal base and to a handful of big customers, who were usually very happy with the performance improvements. We still saw some warnings, but we remained optimistic that we would track all of them down.

But as we added more bases, the warnings kept climbing and it became increasingly difficult to reproduce them. We didn’t hear much from our customers, so we enabled View Projection across many more internal bases in the hope that Airtable employees would encounter these errors and produce more detailed bug reports. Apart from one instance where an employee saw blank cells, we heard nothing at all.

Desperately seeking closure

View Projection was proving highly effective on large bases, so we continued enabling it for more bases, but the warnings did not abate. We were now seeing tens of thousands of warnings a day. There was a prevailing sense that perhaps these warnings were safe to ignore because we were not receiving any bug reports from our users.

Still, the sheer quantity of warnings made us nervous about rolling out View Projection more widely — what if there was an actual issue that we were just not seeing? So we decided to give it one more college try.

The warnings

Looking at the warnings, there was a mix of stack traces, coming from various visual components like grid view cells, detail view record cards, etc. The common theme here was that views were rendering cell data before that cell data was backfilled.

Unfortunately, this was too little information to figure out what was going on. So we added more logging — field and view type, whether View Projection was enabled, etc. This didn’t help much, but it led to us discovering and fixing an unrelated issue in applying realtime changes on clients.

What we really needed to understand was what was happening on the client and the backend before these warnings occurred that could cause missing backfills.

Analysis mode

To figure out what happens before a warning, we decided to query and visualize our logs and usage data (using a tool called Mode). We looked at which actions and events occur more often before a warning vs. during normal usage of Airtable. This yielded graphs like these:

The actual operations here change from day to day. It was pretty unclear how some of them could even trigger a warning. Nevertheless, we tested them out by triggering these actions in the product, but could not elicit a warning. It turns out that comparing these rates to normal usage made these graphs pretty noisy — we should’ve just looked at the raw counts of actions before warnings. More on that later.

Fuzzy logic

The next possibility was that we just needed a complicated sequence of operations to trigger these warnings. While the high volume suggested otherwise, it’s possible that our tests and internal bases were just not emulating common workflows used by our customers. We were aware of one tool that finds such issues — the Stateful Model-Based Database Fuzzer.

The fuzzer was designed to generate sequences of operations that trigger a critical error in the backend. For these warnings, we needed something that could emulate the client by accessing cell data in a similar manner as an Airtable user interacting with views and performing client-side model changes. So we refactored the fuzzer into a generic abstraction that can 1) generate and evolve a random sequence of operations, and 2) notify when a stopping condition is detected.

We then used this new abstraction in a special kind of test where we can emulate both the client and the backend in the same process and thus control the sequence of operations more precisely. We wrote a fuzz test that emulates two users making random mutations to a base and stops when a cell data tracker warning is detected.

Sadly, this didn’t yield anything that interesting. It found one issue which seemed tangential and likely not one that can be manually triggered, but we were unsuccessful with triggering warnings. We could probably further extend the Fuzzer to get over this hump by emulating more real world scenarios (like optimistic updates and rendering a real UI), but it seemed worth it to move on to other strategies before making a bigger investment here.

Understanding recovery

When a client undergoes recovery, it applies backfills and updates, which re-render the UI in rapid succession. We don’t prevent users from making their own changes while this is happening. So recovery is naturally an area of complexity and unsoundness that drew my suspicion.

We wanted to determine if warnings were primarily occurring during recovery. So far, the typical workflow for investigating a warning was to sift through logs in another tool, ELK, which allows us to query and visualize logs from all of our services in realtime. We would examine the logs for the pageLoadId (an identifier unique to every page load) or applicationId (an identifier unique to every base) where the warning occurred and consider the events preceding it. While we do log messages before and after recovery, it was proving pretty difficult to quickly tell if a warning was occurring during recovery.

We wondered if there was a better way to visualize what was happening on the client and on the backend quickly to figure this out. We remembered an internal tool a colleague had made using ELK’s vega-lite integration for visualizing the backend’s request queue.

After a few hours of hasty ramp up, we managed to cobble together this visualization of recoveries and warnings:

In the above graph, each row is a page load for a particular base, with marks denoting interesting events (ordered by timestamp). The green and blue notches demarcate the beginning and end of recovery for that page load respectively. The yellow notches are warnings. You may have to squint but you can see that recovery tends to be short and warnings essentially never occur during it.

This was yet another dead end that exonerated recovery, but at least we now had a fancy new tool to play with.

The smoking gun

We played around with the above visualization to look at warnings in various ways until we landed on something like this:

In this graph, operations executed by a page load are represented as colored notches, where each color corresponds to a particular kind of operation. Warnings show up as green circles. Armed with this, we could piece together what was going on in a particular base just as the warnings occurred.

As we looked at more and more warnings, a pattern emerged: most warnings were preceded by a view/setAsLastUsed operation:

To confirm, we went back and looked at the raw counts of operations before warnings, which showed view/setAsLastUsed near the top:

This suggested that during view switching, we were somehow rendering the new view before backfilling any missing cells. Why would that be? Remember the new View Projection contract: the client loads data for a view and subscribes to it before rendering it. We were violating that in certain conditions.

From the early days of View Projection, the client kept track of views for which it had loaded all data in a set called viewsWithDataLoaded, so that when we switched to a view that was not in this set, we would load that view’s cell data from the backend. Without View Projection all table data was loaded, so this extra tracking was just not necessary.

The initial implementation of View Projection did not have realtime backfill support so it didn’t consider whether the view had a realtime subscription. While the initial load does yield all data for a view, it is the realtime subscription that keeps the view up-to-date. So ideally, once we unsubscribe a view from realtime updates, we should also remove it from viewsWithDataLoaded, but we weren’t doing that.

This meant that loading a view and then switching away from it produced a ticking time bomb: if there were any changes that would require backfills for the view, the client would not receive those but would still assume that it could render the view. The next time the client switched back to the view, it would render these missing cells and trigger warnings.

This was enough information to reproduce the issue and put up a fairly trivial fix. The number of warnings dropped from thousands a day to a couple hundred a day, which was low enough to comfortably unblock the rest of the View Projection rollout.

Epilogue

Hindsight is 2020

The obvious question is, could we have caught this missing cell bug earlier? The change that introduced viewsWithDataLoaded in Feb 2021 called out the lack of realtime backfill support, but the change that introduced view subscriptions was actually made in April 2020 in the context of a different project. When we started using view subscriptions for backfills, we never fixed this discrepancy. We should have formalized the contract with the client and asserted somewhere that any view that is rendered must have its data loaded and have an active subscription.

Airtable’s flexibility can sometimes be a challenge

Like most tech companies, we rely pretty heavily on internal testing to gather feedback and discover bugs. We have used Airtable ourselves in many different complex setups across all our teams, with constant stress testing, but cannot cover every possible software written/to be written (in some sense, our product can be considered Turing-complete). We rolled out View Projection pretty aggressively to internal bases with the hope that an employee would provide a reproducible sequence of operations, but never got one. Even members of our QA team who are particularly adept at finding such multi-step bugs could not find this one. We’re still looking for ways to exhaustively test a good representative set of production schemas, data and workloads in pre-production environments.

Observability is half the battle

The success of this investigation and subsequent fixes hinged on several log lines and fields that had been introduced for completely different reasons, such as logging for process ids, log lines for eviction, crud request log lines, logging of various fields in cell data tracker warnings, etc. We also relied on observability tools like ELK and Mode as mentioned earlier. It’s possible that combing through the codebase would’ve eventually resulted in us finding the same issues, but these tools made it much much easier. We’ve built up a dedicated Observability team within our infrastructure organization, and are investing more in other tools such as Sentry and OpenSearch.

Distributed systems are hard

A common feature of the bugs we found was that they relied on multiple systems interacting poorly with one another. Bugs of a distributed nature are hard to reproduce in tests, hard for a single developer to conceptualize, and hard for testers to reproduce consistently. We will likely increase the number of such systems as we continue to optimize for performance and scale, so simplification is not on the table, but perhaps we can invest in more formalism here by describing the contracts between each of these systems and maybe even using formal methods to prove their correctness.

This concludes the Airtable story about the Curious Case of the Missing Cell. We hope you enjoyed this program. If you’d like to come join us to work on interesting and challenging problems like this one — we are hiring.

Addendum: extended edition

For fans of this program, the new visualization actually helped us solve a couple of other mysteries, not seen in the director’s cut:

Evictions

Even after fixing the view switching bug, we were still seeing similar warnings as before in some cases, except that this time we knew that view switching was not the culprit. We already knew that subscription state management on the worker and the realtime service needed an overhaul (which we have since deployed in parallel to the writing of this post). So we started looking at what the worker and realtime are up to before warnings and noticed a pattern. A teammate narrowed this down to how we were managing view subscription state during base eviction, which is the process of unloading a base and shutting down a worker after a period of inactivity on the base.

We wanted to see how often base evictions was causing these warnings, so we modified the previous visualization to also show evictions:

Near the bottom of the graph, the downward triangle indicates the base being loaded and the upward triangle shows it being evicted. The row here is the processId of the backend worker process. An upward triangle followed by a downward triangle in quick succession means that we began eviction proceedings but then aborted the eviction because an operation arrived before eviction finished, leading to a desynchronized subscription state between the worker and the realtime service. Page loads with aborted evictions as well as cell data warnings also included interleaved updateFilters operations, which likely required backfilling (because they were unhiding rows) that never came. A teammate fixed this by also shutting down the realtime service during eviction so that if it is aborted and the base is reloaded, both the worker and the realtime service have freshly initialized subscription state and there’s no longer any desynchronization.

Force refreshes

We’ve also had an increase in how often we force clients to refresh the page in View Projection. While inconvenient for users, this is sometimes necessary: we force the client to refresh if we determine that we cannot provide backfill for it. For example, we might force a refresh when a user closes their laptop on Friday and returns to it on Monday, by which time we have shut down the backend process for this base due to inactivity. But any increase in error metrics merits a close look, and this was no exception. We extended the vega-lite visualization to show force refreshes and once again a pattern emerged:

You can see that the base was evicted and then loaded in another process, indicating a successful worker restart. This was followed by an old client performing an operation, going through recovery, and being forced to refresh. Why did this happen? It’s because the new worker doesn’t have the necessary backfill data for the old client as the new worker knows nothing about it. We mitigated this to bring down the rate of force refreshes back to acceptable levels and have long-term plans to bring them down even further.