Scalable CouchDB Replication and Change Listening with Spiegel
Editor’s Note: This article on Spiegel is intended for those who already have some familiarity with Apache CouchDB™, a JSON database known for its excellent offline sync capabilities. To learn more about CouchDB and Offline First, check out Geoff’s previous article on CouchDB, PouchDB, and Hoodie as a Stack for Progressive Web Apps.
Spiegel was originally designed to provide scalable replication and change listening for Quizster, a photo-based feedback and submission system. Spiegel is now an open source project available to everyone.
So, why do we need Spiegel? The short answer is that without it, you’ll have a hard time scaling your CouchDB replications and change listening as your user base grows. Let’s take a closer look.
Replication Challenges in CouchDB
Scalable Replication
The _replicator
database in CouchDB is a powerful tool, but in many cases it does not scale well. Consider the example where we have users posting blog entries. Let’s assume that we want to use PouchDB to sync data between the client and CouchDB. Let’s also assume a design of a DB per user and an all_blog_posts
database that stores the blog posts from all the users. Having a database per user will allow us to restrict access to the user databases so that only the owner of a post can edit her or his posts. In this design, we’d want to replicate all our user databases to the all_blog_posts
database. At first glance, the obvious choice would be to use the _replicator
database to perform these replications, but the big gotcha is that continuous replications via the _replicator
database require a dedicated database connection. Therefore, if we had 10,000 users then we would need 10,000 concurrent database connections for these replications, even though at any given time there may be at most 100 users making changes to their posts simultaneously. With Spiegel, we can prevent this greedy use of resources by only replicating databases when a change occurs.
Real-Time Replication Between Clusters
While CouchDB 2 has built-in clustering, one limitation is that this clustering isn’t designed to be used across regions or data centers. Spiegel tracks changes in real-time and then only schedules replications for databases that have changed. You can therefore use Spiegel to efficiently keep clusters located in different regions of the world in sync.
Scalable Change Listening
Let’s assume that we have some routine that we want to run whenever there are changes, e.g. we want to calculate metrics using a series of views and then store these metrics in a database doc for quick retrieval later. We’d need to write a lot of boilerplate code to listen to _changes
feeds for many databases, handle fault tolerance, and support true scalability. Instead, we can define a custom REST API endpoint that calculates these metrics and then a Spiegel on_change rule that will call this endpoint whenever there are applicable changes.
How Spiegel Scales
Spiegel is comprised of three types of processes: the update-listener, change-listener, and replicator. The update-listener listens to the _global_changes
database and then schedules on_change rules and replications accordingly. The change-listener runs on_change rules for all matching changes. The replicator performs replications.
You can run as many update-listeners, change-listeners, and replicators as your CouchDB cluster can handle. In addition, you can fine tune things like the concurrency and batch sizes so that you don’t exhaust your CouchDB resources. In most cases you’ll want to run at least two of each of these processes for redundancy. In general, if you need to listen to more changes or respond to these changes faster, add a change-listener. Similarly, if you need to perform more replications or replicate faster, add a replicator.
There is an official docker image that you can use to run the different Spiegel processes, and you can use Docker Swarm or Kubernetes to easily scale your instances.
In a recent passion talk at Offline Camp Oregon, I shared more on Spiegel’s scalable replication and efficient change listening:
To get started using Spiegel, explore the repo and check out my step-by-step tutorial:
Happy replicating!
About the Author
Geoff Cox is the creator of MSON, a new declarative programming language that will allow anyone to develop software visually. He loves taking on ambitious, yet wife-maddening, projects like creating a database and distributed data syncing system. You can read more of his posts at redgeoff.com or reach him @CoxGeoffrey or at github.
Editor’s Note: Participants at Offline Camp Oregon had diverse backgrounds and interests, ranging far beyond the Offline First approach that we came together to discuss. Through short passion talks, campers shared with us some of the hobbies, projects, and technologies that excite them. We’re sharing a taste of that passion with you here as a preview to our upcoming events.