Time based Indexing, give your data a lifespan

I am using elasticsearch for the past year in a number of projects. It got me thinking a lot of maintaining data. And, general ways I handling data now and past.

In this post, I will talk about time-based index. How could it help in keeping your es cluster in shape.

Common sense first:

We have different kinds of data, and we use them differently. We put data in different time scopes to make sense of it. We need sufficient data to make it useful, but not all the data all the time.

Looking back, on past couple years in my work, especially when working with logging and tracking, hardly, any data created is immortal.

One of recommended practice in micro-service is having its own isolated data store. We try very hard to make lean light services. Regular data housekeeping, might be just the easiest way of cutting down response time.

These days data is exploding. We have different solutions dealing data, from redshift to redis. There is less and less reason of treating the data in one place as storage and building service around it. Instead we should fitting our data store to get optimal service performance, if throughput and response time is your key KPI.

Shouldn’t we put a tailored time scope and TTL on top of our data store as well?

Time based index:

Give your data a TTL is not as simple as it sounds in elasticsearch, in fact they deprecated it!

Delete operations are really expensive

The data we see from elasticsearch are its logic level. The real search magic performs at lucene index level. This means every single elasticsearch document type have multiple lucene indexes underneath. (the more complex document gets, the more lucene index needed to be created)

So, when deletion happens, it will cause a large index segmentation. (especially on bulk delete like ttl). You could adjusting the merge.interval to making merge on elasticsearch more frequently. But, such change in elasticsearch config is not recommended, and may comes with bad side effects.

Here is a great blog post of TTL, understanding the reason behind this deprecation is helpful. (https://www.elastic.co/blog/ttl-documents-shield-and-found).

Shards and data growth:

The benefits of having time based indices is not limited to data expiring and deletion.

Once your elasticsearch cluster start accepting data, it become LIVE.

Here comes the problem. Some or most times we fail in estimating the scale of our data, and making sense of the data we collected. It normally take a few try to get things right.

In elasticsearch, you can decide shard number only at the creation of your index. As your cluster data size grows, this could leads to scalability problem. Reindexing can help the problem, but we don’t want to do that as a routine, moving tons of data is stressful.

You might already heard “over-sharding” to accommodate data growth. But, this means you will going to pay the “over sharding tax”, from day one, even with little data. and secondly, there is no index can take in data forever.

Lets have a look, how elasticsearch handles query first

When client making a query, multiple threads will query through each shards of the index(es). Then it will merge its results, and finally return response back to the client as a single response. When you have a huge amount of data, you would want to have a beefy machine to vertically scale up your es. The shards number will decide the number of threads or CPU cores your elastic engine can utilise. Thus the level of parallelism of your query could be.

With time based index, it create new indexes based on time interval. So your shards number grows along with your data over time. so you dont pay penalty of over sharding up front. and based on the data volume, you could move to a more powerful instance and enjoy more cpus when needed.

single index vs multiple index

From client point of view, they dont need to know how your index is structured but an alias.

as you can see, once indexes is behind the alias we can start creating index based on interval we want. It could be weekly, or monthly, or yearly, take your pick. we dont need to assign huge amount of the shards to each index anymore, as now our query is against alias. the shards will grow as data grows.

So how about wirtes?

Once you start looking your data from a timely fashion. you can then start define some straight forward writing logic. Of course it is very important to make sure future index and mapping is ready before indexing.

Take logging for example, new data is coming in every seconds. Every time data coming in, we have the timestamp for deciding the writing index. on Live service is very unlikely we need to create log for yesterday. So we will have only 1 active writing index at any given time, which is based on the time of NOW. As time go by, active writing index will move forward. and past writing index will be come a read only index, in a immutable state.

by the end of your data retention period, you can easily remove expired index from alias.

Then what about restore and reindexing?

Bad things happens, say you lost your es as well as your backup… (ouch!) but lucky you have your Kinesis stream (or SqlServer) to replay all your data in past X days/months. as long as you have your timestamp handy, you can index your data to the according index. same as the way as you indexing based on datetime.now. such technique would also help you when your want to change your index configuration. i.e. change index interval from weekly base into monthly base.

As a side effect, here is another benefit: Our data is being sliced up into smaller indices. so we only have one index is active indexing data. This makes moving large amount of data more manageable and much less stressful in prod environment.

Similar logic could also apply to fine tune your query request, so you don’t need to always query against alias, but a smaller sections of indexes if you know the time period you looking for.

Don’t over use:

Large number of shards would making result merging expensive. This could result poor performance if you are not careful. A large number of shards also means larger number of queries, this could leads to maxing out query queue on your es.

So there are 4 factors we always need to consider :

1. volume of data

2. performance / response time

3. load of read and writes

4. cost of hosting

How much performance improvement can a higher spec hardware bring us. vs, scale out horizontally. (if you want to handle more requests, adding more nodes might be a easier way)

merging cost of response from shards vs parallelism. based on the volume of the data and in different queries the result might vary.

So, There is no silver bullet, there is try and error as always.