Re-indexing and arbitrary ElasticSearch schema changes for Apache James!

Benoit Tellier
Linagora Engineering
5 min readOct 28, 2017

Linagora, like most software companies relies on Elastic Search as a search engine.

Despite being useful to use, Elastic Search brings additional complexity in. Namely data consistency and indexing format changes.

We do not use Elastic Search as a primary database. Our primary database in OpenPaaS is mongodb, and Cassandra for the Apache James server. Of course, we should ensure the data is always the same in our search engine and in our database. But between the storage in the main database, and Elastic Search indexing, a handful of bad things can happen:

  • crash between both operations (server reboot for instance)
  • event reordering leading to potential lost updates
  • Indexing failure in ElasticSearch…

We thus need to reimport our primary database, source of truth, in Elastic Search. And this without downtime…

Our code needs to deal with several subsequent versions of the search logic ;-)

Our application is living, then our codes needs to evolve. To add new features, or to correct a bug, we might need to change the way we index our data in Elastic Search. Of course again with minimum downtime. This is for instance needed with the unfamous “mail address search bug”, which has a dedicated article.

Both of these problems are similar and can be addressed with similar solutions. Thus we hold a barcamp which aimed at providing a proof of concept for re-indexing with the Apache James Server.

The first thing to do is to rely on aliases. An alias is a “meta-index” that points to one or more real index. You can modify very easily their target indexes. Which allows you to switch the queried indexes without touching to your application logic. Futhermore, it is advised to handle a read and a write index (it allow for instance indexing upcoming emails to two indexes, when data schemas are the same.)

Each new schema version tend to be an improvment of the previous one, really!

Then we of course need version handling. We need to ship the code, provide tests and ease of instanciation for searching logic, and inserting logic. We chose to allow schema version used by James to be configured from configuration files. This aims at:

  • using the new index, with the benefits of the new schema once re-indexing is done, and being able to switch to it.
  • easily providing components for upcoming schema version, which allows implementing migration steps.

A preliminary task is to create the new index corresponding to the new version. This can be done through the webadmin REST api.

That being done, one can start implementing re-indexing task. This involves reading emails from Cassandra and indexing them again in a new alias, with a specific schema version. This can also be done through the webadmin REST api.

Finally, we care about what’s happening during the re-indexing. Thus we want to plug our new index and update it with upcoming changes during the re-indexing process. We can plug a specific additional indexer, pointing to the new index and updating it according to the new schema version. Of course while keeping behavior on the previous index unchanged. This, also, can be done via webadmin.

Here is a re-indexing scenario due to inconsistancies:

  • the admin uses a readAlias and a writeAlias above index 1, which has version 1
  • the admin creates index2 with version 1
  • the admin registers index2 to writeAlias
  • the admin triggers a re-indexing from Cassandra database to index 2.
  • Once done, the admin can switch readAlias and writeAlias to index2 only.

Here is a re-indexing scenario due to schema changes (similar document structure, but indexing changes):

  • the admin uses a readAlias and a writeAlias above index 1, which has version 1
  • the admin creates index2 with version 2
  • the admin registers index2 to writeAlias
  • the admin can then trigger an Elastic Search based re-indexing from index 1, as its documents have the same structure and are considered consistent.
  • Once done, the admin can switch readAlias and writeAlias to index2 only. You can also update james configuration, to show that it is using a new schema version.

Here is a re-indexing for document structure changes:

  • the admin uses a readAlias and a writeAlias above index 1, which has version 1
  • the admin creates index2 with version 2
  • the admin registers an additional indexer of version 2 pointing to index 2.
  • the admin triggers a re-indexing from Cassandra database to index 2.
  • once done, update configuration file to make James use schema version 2 as well as readAlias and writeAlias specific to index2.
  • reboot james to trigger update of query logic.

As you can see, this barcamp provided a proof of concept for all the building blocks for a flexible re-indexing solution. Combined with our Cassandra migration solution, it allows us to now arbitrary remodel James storage without loosing our data.

You can retrieve this work on GitHub.

You can view our demo on YouTube (upcoming!).

--

--