Automated Backup and Restoration of Lucene Indices with Amazon S3

Published in

Keep It Up

4 min readFeb 18, 2014

At FortyTwo, we rely on a main database hosted on RDS to store Kifi‘s critical user data. For instance, every keep (a user-page pair) is saved in there after the Kifi browser extension that sits on your computer sent it to our service (via a “Keeper” machine).

Meanwhile, the Kifi search engine is built on top of Apache Lucene. Abstracting over sharding, each search machine (“Finder” machine) holds a local copy of several indices, which represent the information stored in the database in a form that the semantic engine can leverage.

As a consequence, we need to maintain a number of Lucene indices in sync with the database, in real-time, as the database gets updated with every user action. Each database update, processed by some Keeper machine, must be echoed to every Finder machine so that the change can be reflected in local index files.

On the one hand, each database model that requires indexing keeps track of a sequence number. Every time an instance of the model is created or updated, it is assigned the next number and the sequence is incremented.

On the other hand, on each Finder machine, each local index is maintained by an indexer (an actor). The indexer is interested in one or several models that are required to build its index. So periodically, it must grab and process the latest updates (“events”) for each of these models. The indexer will use a model’s sequence number to remember how much of the update event stream has been consumed. It simply keeps track of the highest sequence number it has seen so far for each model, so that it can ask the Keepers for all modified instances since. As a new batch of events comes in, updated instances are processed into “indexable” entities that are committed to the index. On commit, the indexer also sets each sequence numbers to the highest seen in the batch. And repeat.

In this way, each Finder machine is independently processing database updates, maintaining its local indices in a very robust way, greatly benefitting our continuous deployment process. Indeed, every time a Finder machine is deployed and its Play! application restarted, indexers will pick up the index files on disk, check up the latest sequence numbers and automatically start bugging the Keepers in order to catch up with the latest updates.

All is for the best in the best of all possible worlds.

Well, almost. When a new machine spins up, indices are built from scratch. All sequence numbers are initialized to zero and each indexer starts catching up. They will indeed catch up, but as our corpus has been significantly growing together with our user base, it will take several days. If we were to lose all Finder machines, Kifi would be crippled for a while as indices are being rebuilt.

Short of such a disaster, we have been willing to take advantage of Amazon EC2 “Spot” instances. Spot instances are bid for in real-time and are not guaranteed in any way. While usually much cheaper than On-Demand or even Reserved instances, they can be reclaimed by Amazon without notice if prices spike above our maximum bid. Thus our indexing system needs to gain some resilience from market volatility.

The solution is again quite simple. We need to perform regular backups of each index so that indexers on new machines do not have to start from scratch, but can instead consume only events posterior to the latest backup. Thus, some of the Finder machines should periodically upload their index files to Amazon S3, where they can be picked up by new machines.

Integrity is our main concern here. While an indexer is writing to an index, Lucene does not guarantee that the files on disk are in a consistent state. If index files were to be uploaded at this point, the backup may end up to be corrupt. Thus we need to make sure that indexing stops while we proceed.

Our solution is to bake a backup mechanism directly into Lucene index directories and have the indexer in charge of the directory to execute the backup procedure on commit. Since indexers are actors, they live on a single thread, which guarantees that indexing is blocked as long as the directory is being backed up. Thus we introduce a new trait:

Here is a simple implementation with built in compression (free IO helpers!):

getArchive and saveArchive can be implemented using your storage of choice. We rely on an ObjectStore backed by S3. We can now extend Lucene’s Directory interface and MMapDirectory class:

Each indexer is enriched with the following methods from our Indexer[T] trait:

We use Akka’s scheduler to call backup on every indexer periodically, and the backup is processed after the next successful commit. A different period can be set for every index, depending on how fast it is changing. On each backup, we actually report the size of each index to our analytics so that we can easily monitor its growth.

Now thanks to our BackedUpDirectory trait, restoring an index is very easy. When a Finder’s Play! application starts and indexers are instantiated, we check for each index whether its directory is already present locally (the machine has been restarted and must simply catch up with the latest updates) or not (this is a new machine, the directory should be restored from S3 before the machine starts catching up from there). As mentioned in previous posts, we use Guice to manage dependencies at runtime, thus each indexer provider relies on the following method to get its IndexDirectory:

That’s pretty much it. Within a few minutes, we can spin up new Finder machines that take care of themselves and quickly get Lucene fully up and running, just so that we can find your keeps!

We can index your stuff too, check out Kifi now!

Originally published at eng.kifi.com on February 18, 2014.

Automated Backup and Restoration of Lucene Indices with Amazon S3

Written by Léo Grimaldi