How to manage indexing with dynamic time-series data in Elasticsearch?
Elasticsearch is a well-known database for searching a large amount of data. It performs blazingly well with search queries, such as full-text search and term-based queries, compared to regular databases. However, maintenance can be a nightmare if you don’t handle dynamic time-series data carefully.
Why?
First off, let’s talk about what we mean by dynamic time-series data. Simply put, if you update your data at any moment after you created it, it’s dynamic data.
As in other databases, you will eventually want to have a mechanism to delete old data since the resources are finite. This can be problematic if you have dynamic data. If you use multiple indexes, data duplication can cause you headaches. If you use a single index, you have to manually set up a way to delete old documents.
In this article, with my colleague Fırat Feroğlu, we are going to talk about the problem dynamic data caused us in production, how we solved it and, automated deletion of the old documents without any duplication or performance problems.
Table of Contents
∘ Our Story
∘ The problem of Elastic Index Lifecycle Management
∘ The permanent solution: Delete by query
∘ Automated document deletion with CI/CD Integration
∘ Performance metrics
Our Story
To better explain the problem, let’s first talk about why and how we use Elasticsearch.
As a Delivery Core team, one of our responsibilities is to provide a way to visualize how many cargoes are on the way, delivered, canceled, or else. It also needs to be grouped by cargo firms. So that it is possible to know how many cargoes of FedEx are delivered each day and so.
With the help of these dashboards, the warehouse team adjusts cargo firms’ capacities. For example, if one cargo firm is not performing well, we don’t overload it with new cargoes, instead, we reduce its capacity and direct new cargoes to other cargo firms. Thus we manage cargo deliveries more efficiently.
Every day more than a million cargo is created, transferred, or delivered. To be able to visualize them quickly, we use Elasticsearch.
Below you can see a simplified cargo document we store in Elasticsearch.
{“shipmentId”: 123,“status”: “DELIVERED”“cargoFirm”: “UPS”,}
We create Kibana dashboards using these documents, as simplified below.
As you can see, we group all the cargo documents by their statuses and cargo firms.
Notice that the status of the document is being updated. We update the status as TRANSFERRED when it is shipped, or CANCELLED when it is canceled by the user.
It’s also important to mention that we use SQL database and feed Elasticsearch when cargoes are updated. So Elasticsearch is not the main database.
Alright, now that we’ve covered our need and usage of Elasticsearch, we can talk about how deleting old documents has caused us trouble.
The problem of Elastic Index Lifecycle Management
Index Lifecycle Management (ILM) is a feature of Elasticsearch that can manage index lifecycles. You can automate the creation, archiving, and deletion of indexes.
You can set document, size, or age limits for indexes. For example, once there are more than 10 million documents or they occupy at least 1 GB or they’re more than 30 days old, the index changes phase by becoming read-only. At the same time, a new active (read/write) index is created.
Our need was to keep the last 90 days of data so we’ve set up ILM to change phase after that period. And when that happened, that was the point of no return for us. There were two main problems:
- The new index did not have much data. So for the dashboards we still needed to query both indexes. Therefore, we’ve had to keep the old index until the new index has 3 months of data. Even though we’ve only needed the last 3 months, we started to store up to 6 months of data.
- A bigger issue was that data duplication occurred. Same cargo is both counted as Transferred and Delivered.
Figuring out how duplication occurred puzzled us a bit. Here is how. Say we have a single index A which stores our cargo which has status Transferred. Then a new index is created, say B, and our cargo is delivered to the customer; we update it as Delivered. Since the active index is B, instead of updating the document at A, we put “Delivered” document to B. So we now have Transferred documents at A, Delivered documents at B. Thus, it’s counted as both Transferred and Delivered.
For a quick fix, we could’ve simply updated the dashboard queries but there was no way to query data in Kibana like “search this cargo, and if there are multiple documents, ignore the old and get the only with the latest status”.
We solved it by the following steps,
- feeding the Elasticsearch with the last 3 months’ data
- update dashboards to query on only the latest index
- disabled the ILM
This solved our problem but we needed a permanent solution. Our requirement was to keep the last 3 months of data. It should be handled without regular monitoring and manual intervention. Our research started here.
Unfortunately, there was not much information on this issue. The first thing we’ve come across was Elastic Curator. It has extended features on index management. Yet, it could not delete data inside an index.
The permanent solution: Delete by query
When we came across this query, we’ve got excited and were questioning the potential performance problems of deleting more than a million documents in a single query. Fortunately, it turned out a good solution.
Here is how we use it.
Say we have an index named my-index. If we want to get the number of documents older than 90 days we run the query below.
GET my-index/_count{ "query": { "range": { "@timestamp": { "lt": "now-90d/d" } } }}
The response,
{ "count" : 1260000, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }}
Let’s delete the old documents with delete by query.
POST my-index/_delete_by_query?conflicts=proceed&slices=5&refresh{ "query": { "range": { "@timestamp": { "lt": "now-90d/d" } } }}
We use three parameters conflicts, slices, and refresh.
- conflicts: A conflict can occur if the same document is attempted to be updated and deleted at the same time. Default is “abort” which cancels the execution of the query. Proceed allows us to continue execution without deleting that document.
- slice: splits the job into tasks. You can manually set it to the number of shards or you can leave Elasticsearch to handle that.
- refresh: refreshes shard caches thus preventing searching the deleted data immediately after executing the query.
In addition, there is wait_for_completion parameter. We don’t use it but it is useful if you want to run the delete query asynchronously.
The simplified response,
{ "took" : 72246, "timed_out" : false, "total" : 1260000, "deleted" : 1260000}
It took 72.2 seconds to delete 1.26 million documents. timed_out shows if the request is timed out. Even if your query gets a timeout, Elasticsearch will keep deleting the documents.
For more on request and response parameters, you can check out the official documentation.
Automated document deletion with CI/CD Integration
So far we can successfully delete the documents. But we don’t want to run the query manually. That’s why we schedule it using Gitlab. You can also write a scheduler service. Since using a Gitlab scheduler is quite effortless, we’ve preferred that. You can also adjust the code below to your CI/CD tool.
We create a script that consists of the queries we’ve discussed. We made it reusable for other indexes, even other Elasticsearch instances, with the use of parameters.
We print variables, a number of old documents, then delete them and print the result.
#!/usr/bin/env bashset -eecho "ELASTICSEARCH URL: $ELASTICSEARCH_URL"echo "INDEX: $ELASTICSEARCH_INDEX"echo "OLDER THAN: $DELETE_DAYS_OLDER_THAN"echo "SLICES: $SLICES"echo "Matching document count:"curl -s -X GET "$ELASTICSEARCH_URL/$ELASTICSEARCH_INDEX/_count" -H 'Content-Type: application/json' -d '{ "query": { "range": { "@timestamp": { "lte": "now-'${DELETE_DAYS_OLDER_THAN}'d/d" } } }}'echo "Running delete query:"curl -s -X POST "$ELASTICSEARCH_URL/$ELASTICSEARCH_INDEX/_delete_by_query?pretty&conflicts=proceed&slices=$SLICES&refresh" -H 'Content-Type: application/json' -d '{ "query": { "range": { "@timestamp": { "lte": "now-'${DELETE_DAYS_OLDER_THAN}'d/d" } } }}'echo "Script completed."
Here’s the pipeline definition,
stages:- deletedelete: stage: delete image: name: path-to-image/centos:centos7 only: - schedules variables: ELASTICSEARCH_URL: http://localhost:9200 ELASTICSEARCH_INDEX: test DELETE_DAYS_OLDER_THAN: 365 SLICES: 1 script: - bash delete.sh
It has a single stage, named “delete”. It takes parameters from Gitlab Scheduler, otherwise uses the defaults and runs the script on centos image.
In the Gitlab Scheduler page, we set the cron period as 16:00 and define the variables.
Thanks to this simple structure, we can adjust day, index, and Elasticsearch instance without any code. We can also define multiple jobs for multiple indexes.
Performance metrics
The main concern we’ve had with this solution is whether deleting more than a million documents at a time causes a performance hit. We could not find any resources on the web. Fortunately, it went better than expected and we’re happy to provide the metrics.
Our scheduled job starts around 3:05 PM. You can see the effect in the metrics below. We use the monitoring pages of Kibana and Grafana for the metrics.
Throughout the article, we’ve talked about potential indexing problems with dynamic data, how to solve it by using delete by query, automation with CI/CD, and performance effects of the solution.
Special thanks to Ahmet Dikici and Bilal Çalışkan from the DevOps team and our team for collaboration and support at all stages of the process.
Thank you for reading.