How We Boosted Our Elasticsearch Performance by Blasting Our Indexes Scope

Niv Rabin
CyberArk Engineering
7 min readNov 29, 2022

Elasticsearch is a distributed search and analytics engine built on Apache Lucene and designed to store huge amounts of data for near real-time search and analysis purposes.

An Elasticsearch index is a collection of documents that are related to each other, like a table in a relational database.

At CyberArk, we use Elasticsearch to store and query our tenants’ cloud accounts data, constructed from their latest cloud resources on each account.

The way we used to index our tenants’ data in Elasticsearch caused performance issues in an early phase of development (luckily) and turned out to be not very scalable.

So, in order to benefit from Elasticsearch strengths and scalability, we had to adapt a new approach.

Here’s how we did it…

Our (Old) Account Indexing Process

Our tenants’ data is being processed in parallel batches several times a day.
Each batch gathers all the tenant’s accounts and triggers another parallel batch to process each account’s data.

One of the processes is to index the account’s resources to Elasticsearch.

Here’s the flow:

  1. Gather and enrich the account’s latest resources.
  2. Index the account’s resources to a new index.
  3. Add a ”latest” alias to this index’s name.
  4. Delete the account’s old indexes.

The Pros of This Indexing Process

  • The flow is fairly atomic and immutable.
  • We don’t need to deal with what was changed in which resource.
  • The risk of the data nodes running out of disk is reduced.
  • It’s fairly easy to maintain (for us developers, not for Elasticsearch).

The Cons of This Indexing Process

  • This indexing process runs on all of our tenants accounts in parallel.
  • Due to the unbound amount of our tenants accounts— this indexing process overwhelmed our single Elasticsearch cluster that couldn’t handle the amount of I/O operations required.
  • We reached a JVM memory pressure of 100% and parent circuit breaker exceptions.
Master node JVM memory pressure spike

The Challenge: Identifying the Root of the Problem

An Elasticsearch Cluster and Index diagrams by Codingexplained.com

Elasticsearch is built for searching tons of data efficiently.
Swapping indexes in bursts might be convenient for us developers, but that’s not how Elasticsearch is designed to be worked with.

These Elasticsearch clusters and index diagrams above show that an index should actually contain lots of data.
If we encounter such performance issues and overwhelmed our cluster before we even started to scale — might be a sign that we should reconsider the way we handle our data in Elasticsearch, and how we manage our indexes in particular.

More specifically, due to the fact that most of our tenants have only few accounts, with a small amount of resources means that we can change the current index’s scope to store more data, from one account to multiple accounts per index, or more generally — an index per tenant.

Multiple Accounts Index Lifecycle

So if creating and deleting indexes in bursts causes performance issues, how about this flow?

  1. Gather and enrich the account’s latest resources.
  2. Index the account’s resources to a new tenant’s index.
  3. ̶A̶d̶d̶ ̶a̶ ̶”̶l̶a̶t̶e̶s̶t̶”̶ ̶a̶l̶i̶a̶s̶ ̶t̶o̶ ̶t̶h̶i̶s̶ ̶i̶n̶d̶e̶x̶’̶s̶ ̶n̶a̶m̶e̶.̶
  4. ̶D̶e̶l̶e̶t̶e̶ ̶t̶h̶e̶ ̶a̶c̶c̶o̶u̶n̶t̶’̶s̶ ̶o̶l̶d̶ ̶i̶n̶d̶e̶x̶e̶s̶.̶

If only it was that easy…

What About Deltas Between Batches?

As mentioned earlier, our tenants’ data is being processed in batches several times a day.
Between those batches, the account’s resources might change, some of them might be assigned with a new role, some new resources might be added, and others might be deleted.

Our Elasticsearch indexes should only contain the latest state of the tenants’ data.

With that challenge in mind and an index-per-tenant approach in our hearts, we set off to find the solution that will help us achieve this goal.

Here are several solutions we went through until we found the solution that worked for us.

Solution #1 — Mark Resources for Deletion

Index the account’s resources to Elasticsearch flow:

  1. Mark the current account’s resources for deletion.
  2. Gather and enrich the account’s latest resources.
  3. Index the account’s resources to the tenant’s index.
    — Potential risk: A failure after this step yields a state of resources that
    are marked for deletion; this can be taken care of while searching for resources though.
  4. Delete marked resources from the tenant’s index.
    — Potential risk: A failure after this step yields storing resources that should have been deleted.

There are too many potential risks.

Solution #2 — Index Then Delete

Here we tried a different approach, splitting the process into two parts where the second part should run at the end of the batch after we’ve verified that nothing else failed.

  1. Index the account’s resources to Elasticsearch flow:
    — Gather and enrich the account’s latest resources.
    — Add the batch’s timestamp to each resource (this timestamp is also saved on the DB as the account’s property).
    — Index the account’s resources to the tenant’s index.
  2. Delete unnecessary resources.
    — Delete the tenant account’s resources with a different timestamp than what’s on the DB for that account.

It felt quite atomic, but not hermetic.
There can also be a long gap between part 1 and 2 during which a batch might fail.

We looked again for a better solution.

Solution #3 — The Chosen Solution: Forget Immutability, Upsert and Cleanup

Index the account’s resources to Elasticsearch flow:

  1. Gather and enrich the account’s latest resources.
  2. Find resource deltas between the previous and current batch.
  3. Build and execute Elasticsearch bulk operations.

As opposed to the previous solutions that index all of the resources every time, the only time we index all the resources is when an account is being processed for the first time.
After that (unless all of the resources were changed), we’re only dealing with the deltas.

But… How?

In order to identify what changed between batches, we need to compare the account’s resources state of the previous batch with the current batch.

After each resource is enriched, we hash the final resource object (thanks, Ariel Beck!)

Then we store an map of each resource Id as key and its corresponding hash as value (hashMap below), and an additional map with each resource Id as key and its corresponding Elasticsearch document Id as value (idsMap below):

Tip: Make sure to preserve the Elasticsearch doc Ids for existing resources.

Then we compare each resource Id from the current batch’s hashMap to a resource Id in the previous batch’s hashMap and determine the required action according to the following logic:

Here’s an example:

Comparing previous batch’s resources (left) with current batch’s resources (right)

No action: role/role1 and user/full-admin (unmarked) because their hashes are the same.
Updated: user/niv and group/Admins (notice that the doc id was preserved).
Deleted: user/Admin1, user/Admin2, user/Admin3.
Created: rold/prod, group/group_to_remove_users_from.

Now that we figured out what operations need to be executed on our Elasticsearch index, it’s time to execute them…

Executing Elasticsearch Operations

We’re using the Elasticsearch _bulk API to execute our operations, which support the following operation types:

  1. Index will always succeed (either create or replace the document).
  2. Create will only succeed if a document doesn’t exist.
  3. Update would only make sense if we knew which properties we were updating (we’re not, it’s always full documents).
  4. Delete will only be executed if we identify a document that existed on the previous map, but not on the current one.

We managed to narrow down those four to two operations:
Index and Delete.

What we achieved and lessons learned

So now that we are in harmony with our Elasticsearch cluster there’s much more room to scale.

Our Master node spikes went from exponential to linear:

From exponential spikes graph to a linear one

In addition, thanks to our new hashing mechanism, we only do I/O operations on an Index when a resource changes, and as a result we reduced the number of I/O operations and the overall load on our cluster.

The solution that at first appeared the most difficult turned out to work the best and gave us the most control over the process.
So, whenever you encounter a challenge, don’t rule out an option only because it looks difficult or too complicated at first. It might end up being the one solution that’s the most beneficial.

--

--