Rethinking Offline First sync for Service Workers

PouchDB-CouchDB sync via Service Workers, as goofily illustrated on the author’s Surface

At Offline Camp California in November 2016, John Kleinschmidt and I tried to tackle the seemingly clear-cut problem of how to get PouchDB replication to run correctly in a Service Worker. However, even though John and I were familiar with both PouchDB and Service Workers, and even though I work for a browser vendor that’s currently implementing Service Worker, we ran into impedance mismatches between what we thought a Service Worker was capable of and what we were trying to accomplish with offline replication.

The goal of this article is to explain where we went wrong in our understanding of Service Workers, and how we began adapting PouchDB replication to a new and unfamiliar context. I’ll also go over some of the trickier parts of the Service Worker API, which may prove surprising to the uninitiated. John wrote up his own thoughts on the subject, and in this post I’d like to expand on what we’ve learned since then.

Scoping the problem

First off, what was the problem we were trying to solve? Well, one issue John identified was the performance impact of IndexedDB on the main thread, which can only be alleviated by moving PouchDB either to a Web Worker or to a Service Worker. John’s project, HospitalRun.io, was already making use of Service Worker, and so it seemed to be a natural fit — no need to allocate additional workers just to run PouchDB.

However, the first thing we realized, and which I summarized in a talk I later gave at DotJS, was that a Service Worker can’t really be treated the same as a Web Worker. They share some superficial similarities, but at a core level they have very different architectures, which lead to different strategies when designing for one versus the other.

On the surface, Service Workers look quite similar to Web Workers. They both run on separate threads from the main UI thread, they have a global self object, and they tend to support the same APIs. However, while Web Workers offer a large degree of control over their lifecycle (you can create and terminate them at will) and are able to execute long-running JavaScript tasks (in fact, they're designed for this), Service Workers explicitly don't allow either of these things. In fact, a Service Worker is best thought of as “fire-and-forget” — it responds to events in an ephemeral way, and the browser is free to terminate any Service Worker that takes too long to fulfill a request or makes too much use of shared resources.

This led us to our first real hurdle with Service Worker. Our goal, as we originally conceived it, was to use PouchDB’s existing replication APIs to enable bi-directional sync between the client and the server, with the client code isolated entirely to the Service Worker. So as a first pass, we simply loaded PouchDB into our serviceWorker.js, waited for the 'activate' event, and then used the standard PouchDB “live” sync API:

This resulted in a silent error, which took quite a while to debug. The culprit? Well, PouchDB’s “live” sync depends on HTTP longpolling — in other words, it maintains an ongoing HTTP connection with the CouchDB server, which is used to send real-time updates from the server to the client. As it turns out, this is a big no-no in Service Worker Land, and the browser will unceremoniously drop your Service Worker if it tries to maintain any ongoing HTTP connections. The same applies to Web Sockets, Server-Sent Events, WebRTC, and any other network APIs where you may be tempted to keep a constant connection with your server.

What we realized is that “the Zen of Service Worker” is all about embracing events. The Service Worker receives events, it responds to events, and it (ideally) does so in a timely manner — if not, the browser may preemptively terminate it. And this is actually a good design decision in the spec, since it prevents malicious websites from installing rogue Service Workers that abuse the user’s battery, memory, or CPU.

So, keeping in mind that we’re trying to accomplish bi-directional replication (i.e. data flow from server to client and vice versa), let’s review what events the Service Worker can actually receive and respond to.

The ‘fetch’ event

This is the most famous feature in the Service Worker arsenal — by capturing this event, the Service Worker can intercept network requests and respond with its own content.

Because the CouchDB API is entirely REST-based, it would be technically possible to implement a caching layer via the 'fetch' event. However, this would be working against the grain of CouchDB's bi-directional replication model, and would do little to answer the question of how to handle conflicts and merges, as the existing CouchDB replication protocol does.

Such a technique may make sense for lightweight caching of a read-only REST API, but it doesn’t make much sense for CouchDB. In the same way that it would be silly to take the Git protocol and map every HTTPS request to an object that can be independently cached and invalidated, we decided it would be silly to route our PouchDB replication system through 'fetch' events. That led us to the other Service Worker APIs.

The ‘sync’ event

This event is defined by the Background Sync API, and is probably the most confusingly-named of the Service Worker events. Rather than having anything to do with “sync,” this is merely an event that fires when the browser goes from an offline state to an online state. The goal is to allow for the Service Worker to use this “just went online” event as an opportunity to push unsynced changes from the client to the server.

In the case of PouchDB, this would be as straightforward as waiting for the 'sync' event and then firing a single-shot replication from the client to the server. There would be no need to keep a persistent HTTP connection open – we can merely wait for the 'sync' event to notify us that the device has come online, and then we allow PouchDB to replicate from the last checkpoint it may have stored from any previous replications. Once that’s complete, we would just patiently wait for the next 'sync' event before doing the same single-shot replication all over again.

When thinking about this design, the first question you might ask is why the 'sync' API even exists, given that there is already navigator.onLine and similar online/offline events available in the browser. The answer is that since this event is available to a Service Worker rather than the main UI thread, you can actually capture it even when the user doesn't currently have your website open. So in this sense the 'sync' event is much more capable than classic online/offline events.

The second question you may ask, if you’ve been working with Offline First architectures for a while, is how a simple indication that the browser believes it is online can handle the trickier cases of captive portals, lie-fi, and other situations where the device appears to be online, but the network request still fails for one reason or another. The answer is that the 'sync' API doesn't answer this question, but the Periodic Sync API does.

The 'periodicsync' event, as defined in the current Background Sync API working draft, allows you to schedule repeated events that the Service Worker can intercept. For instance, you might register an event that fires once every 30 minutes, which is guaranteed to some rough degree of precision. You could also specify that the event should only fire if the device isn't currently on battery, is connected to WiFi rather than a cell network, etc.

Unfortunately this API doesn't exist in any browser yet, but it does contain the building blocks for a robust client-to-server sync mechanism. And in the meantime, it’s somewhat polyfillable via setIntervals in the main UI thread. The downside, of course, is that such a polyfill wouldn't work if the user wasn't currently on the page, but it’s a reasonable workaround.

In any case, we now had a basic idea of how we might implement client-to-server replication using Service Worker. But what about server-to-client?

The ‘push’ event

Again, this is an event that can be easily misunderstood based on the name. From reading much of the documentation on the Web Push API, you may be led to believe that it’s only useful for push notifications, and indeed, this is the headliner feature of the 'push' event.

However, it turns out that the Service Worker can do more than just display a notification when receiving a 'push' event: as the spec says, you may “write the data to IndexedDB, send it to any open windows, display a notification, etc.” This means we can use the 'push' event as a signal to push data from the server to the client, and then store it in PouchDB.

As a basic example, you might imagine that your server contains a process that is listening to CouchDB’s changes feed, via standard HTTP longpolling. When it sees an update in the changes feed, it sends a notification request to the relevant Push Messaging Service for that device, as defined in the IETF HTTP Push specification. In practice, this could be any one of Firebase Cloud Messaging (FCM, formerly GCM) (Chrome), Mozilla Push Service (Firefox), or Windows Push Notification Services (WNS), although since it’s an established standard, neither the client nor the server need to know the details of which service they happen to be using.

Because push messaging is implemented at the level of the operating system itself, this has a huge advantage over longpolling-based sync, in that the device only ever needs to keep one connection open at a time, shared by the entire operating system. And once the 'push' event is received by the Service Worker, the push message itself doesn't even need to contain any data – we can simply do a single-shot replication using the regular PouchDB replicate() API, and then wait for the next push message after that.

One small limitation of this approach is that Chrome currently requires setting the userVisibleOnly flag to true, which means that any push event must also display a notification to the user. (Firefox doesn’t have this restriction.) So in cases where you absolutely want to avoid notifications, you may have to use the 'periodicsync' event instead. This would mean foregoing any “real-time” updates from the server to the client, but the advantage is that you can avoid interrupting the user’s workflow with unwanted notifications. Another alternative is to rely on unspec’ed “silent push” budgets, but this may vary from browser to browser.

In any case, using the techniques described above, we’ve designed a bi-directional replication system based not on HTTP longpolling, but on the Service Worker event model itself. PouchDB, in this system, is only responsible for doing one-time, one-way replications triggered by those events. In this way, we avoid long-running HTTP connections and also allow the browser to spin up and terminate the Service Worker to do its work in short bursts, which is not only a more efficient architecture, but also more in keeping with the spirit of the Service Worker API.

Conclusion

As we’ve learned, there’s a lot more to Service Workers than initially meets the eye. Although sharing some cosmetic similarities with Web Workers, they are a different beast entirely, and often require a very different approach. Being aware of the Service Worker lifecycle and using a “fire-and-forget” model are the best paths to writing effective Service Worker code.

Furthermore, we’ve shown how it’s possible to leverage the Service Worker event model to build a bi-directional replication system, using the 'sync' event to send data from the client to the server, and the 'push' event to send data from the server to the client. In the future, the 'periodicsync' event may also be useful for handling 'sync'-triggered client-to-server requests that fail due to lie-fi or other intermittent network problems.

After our initial rocky attempts to enable full Service Worker-based replication to PouchDB, John and I brought our conversation to a larger group and started hammering out the finer-grained details of what such a system may look like. Gregor Martynus, Garren Smith, and I explored some of these ideas at the CouchDB Dev Summit in February 2017, and later we began collaborating on a shared design document in GitHub.

Whiteboarding with Gregor and Garren at CouchDB Dev Summit 2017

Hopefully this brainstorming will someday result in a new PouchDB plugin, which should offer a similar experience to PouchDB’s existing “live” replication APIs, but implemented in a fully “Service Worker-y” way. In the meantime, I hope that this exploration of Service Worker’s more beguiling concepts can serve as a useful lesson for those who, like us, thought we knew what Service Worker was all about, and found ourselves more and more surprised the deeper we dug in.

Thanks to John Kleinschmidt, Jan Lehnardt, Gregor Martynus, and Garren Smith for feedback on a draft of this post.