From Hackfest to Production, Integrating with Elasticsearch

Introduction

About a little over a year ago, our Projects Ruby on Rails application provides search functionality that rebuilds a search index in a cron job every half an hour. This means that any creates and updates that the customers made to the database would not be available for search immediately. In one of our semi-annual hackfests in 2017, a group of us decided to experiment with integrating Near Real-Time Search with Elasticsearch into our platform.

After a successful hackfest prototype, we have since taken the extra step to integrate this into our Projects module. As a result, customers can now search records in Projects shortly after a record is updated. In this blog, I will be discussing some of the considerations and discoveries in integrating with Elasticsearch.

Integrating Projects with ElasticSearch

Elasticsearch is a real-time distributed search engine. Web applications like Projects can communicate with a hosted Elasticsearch service via the Elasticsearch web API calls. Every time a record is created or updated in the database, a corresponding post call is made to the Elasticsearch endpoint in the Rails ActiveRecord after_save callback to update the corresponding search index.

Immediately after the post to update the search index, the updated term is immediately searchable via the Elasticsearch search API. Elasticsearch has a pretty comprehensive Getting-Started guide, that explains everything from definitions, setup, to configuration.

We evaluated the available Ruby gems for integrating with Elasticsearch, and decided to go with the Searchkick gem.

Searchkick provides some nice application-level features on top of Elasticsearch API’s that Projects makes use of. To allow a content of a model to be searchable, one would simply add the searchkick method to a model class (ie. Control), then call

class Control < ActiveRecord::Base
  searchkick
...
end

Control.reindex

which would send off batches of Control records to be indexed to the Elasticsearch service. The endpoint of the Elasticsearch service can be configured in a config/initializers/elasticsearch.rb, with

ENV[“ELASTICSEARCH_URL”] = http://localhost:9200

Multi Search

In Projects, you can perform a single search against all your models (ie. objectives, risks, controls, findings, etc.) within an audit. The Multi search feature in the searchkick gem allows the search request against multiple targets to be batched into one call, and returning one result set, eliminating extra network calls.

Reindex Without Downtime

Searchkick allows for reindexing without downtime. The schema of what fields we may search on a model may change, as new fields are added. Whether we want to rebuild all the indices with either the searchkick rake task, or doing a reindex on a model, it can be done without interruptions to the search capability.

Bulk indexing

Searchkick allows for bulk updates of the indexing operation within a block by surrounding it with the following.

Searchkick.callbacks(:bulk) do
end

This is useful in certain Projects’ features where we may create a large number of records at once, but would like to defer the update of the Elasticsearch index until the end with a batch call to minimize the number of network requests made.

Ruby Exception Handling

Since Elasticsearch indexing calls happen in an after_save callback, any exception thrown from the Elasticsearch API would automatically fail the transaction and raise an error. This isn’t ideal if the record can actually be created but just something went wrong with the indexing call.

We decided to rescue any exception thrown from Searchkick during the reindexing process, and just log the error in Airbrake, our production exceptions monitoring system. This way, we can investigate any indexing failures that impacts the search functionality, while the customer can continue operating on the updated record.

Production-Readiness

Getting Elasticsearch integrated with the application was just the first step, there were quite a few things that we had to consider, before taking Elasticsearch live on our system.

Reindexing Against Production Clones

We spun up database clones for each of the databases that we have, and tried to reindex all the existing data into the Elasticsearch service, using the searchkick rake task in order to identify performance issues, and uncover invalid data.

Eager Loading the Associations

The reindexing operations happen in batches, and by default with a batch size of 1000 records at a time. In the reindexing log, we noticed that the associations to a model sometime are loaded multiple times (ie. when we reindex the Control ActiveRecord model, the Objectives which the Control belongs to are loaded individually, so there were some N+1 issues. We had to eager load the associations properly, to prevent slow reindexing, using searchkick’s scope :search_import

scope :search_import, -> { includes(:objective) }

Reindexing Batch Size Adjustments

In reindexing certain models in production, we would occasionally see the error

Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge

It turns out that Amazon Elasticsearch, where our Elasticsearch clusters are hosted, contains certain network limit.

The network payload limit is determined by the EC2 instance we choose to host the Elasticsearch service. We have to adjust the batch size for certain models in order to keep within the network payload limit when we do batch reindexing.

Invalid Text Data

One of the interesting things that came out of when doing a dry run against production data is troubleshooting the error

Elasticsearch::Transport::Transport::Errors::GatewayTimeout

We had to debug through our records, to find the offending records that were causing such a reindexing exception. Upon examination of the problematic record, we have found the test result being indexed contains a img tag with base64-encoded PDF.

It turns out certain customers have dropped PDF documents into the CKEditor plugin that is used in some of our text fields, and it generates in our backend, long unreadable long text string in base64 format. This causes an error response in Elasticsearch trying to reindex such a string, we have since sanitize such strings from the reindexing operations, and prevent such text from being entered by mistake.

Cluster Sizing

Part of the production data reindexing work also gave us the benchmark we need to size the Elasticsearch instances properly. We made use of the Amazon Elasticsearch Service dashboard, to see the amount of free memory and disk space available after all the indices had been built in our test runs.

Encryption At Rest

At ACL, we take data security and privacy very seriously. Initially, when we first rolled out Elasticsearch back in October, 2017. We had self-hosted our Elasticsearch server instances in EC2 and added the proper encryption to satisfy our encryption at rest commitment to our customers.

Since then, Amazon Elasticsearch Service has announced Encryption-At-Rest is available on Amazon Elasticsearch Service, and we migrated over.

On Amazon Elasticsearch Service, the underlying file systems in which all the data are stored are encrypted, including the primary and replica indices and log files.

Data Availability

This is taken care of out-of-the-box by Amazon Elasticsearch Service. Amazon Elasticsearch Serice automatically detects node failures and replaces the node. At a minimum, three nodes are allocated for each region to allow for two replicas for replication in case one node goes down. When a node does go down, the service will acquire new instances, and will then redirect Elasticsearch requests and document updates to the new instances.

Reindexing Operations in a Delayed Job

Recently, our monitoring and alerting system informed us of unusual slowness in record updates in one of our production regions. Upon investigation by our team of production support engineers, we found that there was a temporary performance degradation of the Elasticsearch Service endpoint, which in turn causes slowness of the whole application, because Elasticsearch updates were made synchronously with customer browser requests.

With the initial rollout of Elasticsearch, any of our batch cloning operations’ Elasticsearch calls were already done in bulk and in a delayed job. This was the first occurrence in almost a year in production, that we observe that the Elasticsearch indexing call for individual records could face a slow response time.

As a further measure to make sure our system is resilient to any degradation of the ElasticSearch Service, we have responded by making the call to Elasticsearch indexing call asynchronous as well in a delayed job, so that any slowness in updating the search index would not impact the end-user experience.

Conclusion

Integrating Elasticsearch is a good example of how the ACL R&D team takes an impactful hackfest project right to production.

The start of the project was the team realizing a customer pain point of not able to search forany content that they have added into our platform right away, and prototyping a good solution for it.

The process of taking it to production is carefully planning out all the production-readiness tasks to make sure it becomes and continues on as a feature that stands up to our commitments to customers in confidentiality, data integrity, and availability.