Experience working with 600+TB ElasticSearch cluster

Guillaume Dauvin
Botify Labs
Published in
9 min readOct 21, 2020

One of our constant challenges at Botify is the size of the datasets we are analyzing. Our biggest customers ask us to analyze websites that sometimes have hundreds of millions of different pages! When analyzing relations between pages on websites of this magnitude, the graphs can get immensely large, to the tune of tens of billions of entries. In 2014, when we were exploring new ways to store and query our customers’ website metrics, we came upon ElasticSearch and its ability to reliably store and query huge datasets at scale.

Between 2015 and mid-2018 we invested heavily in our ElasticSearch cluster since it was our primary data store. Today, this cluster contains over 120 billion documents, each made up of more than 450 fields and representing a whopping total of 670TB of data (not counting replicas) on the disks of 69 physical machines. It has served us well and has allowed Botify to help happy customers from all over the globe improve the visibility of their websites on search engines.

As we are now sunsetting this ElasticSearch cluster in favor of a new data store, we wanted to take a look back at these last 5 years and share what we learned with the community. Here are our more important lessons on running a large ES cluster.

This article is not meant as a guide on how to run a large ES cluster; instead we want to share our experience with our particular uses and how we solved our very Botify-specific problems.

A glimpse of Botify

Historically, Botify started with logs ingestion and processing, analyzing our customers’ HTTP logs to trend events of visits from bots and humans.

We started using ElasticSearch in 2014 when we saw there was a clear market interest for our new product: SiteCrawler. It crawls every single page of our customers’ website and computes various SEO metrics for each page. There are two facets to analyzing this data: the raw metrics (i.e. “how long did a certain page take to display?”) and the trends (i.e: “is my website slower today than yesterday”). For the raw metrics, our customers are mostly interested in the latest metrics from the latest crawl. This means that our goal is to crawl often, with an active (or hot) lifespan for the data that stays relatively short (around 2 months). However for trended data, our customers like to be able to look at data points ranging from days to years.

Botify’s ElasticSearch Cluster

Our cluster is composed of 69 servers:

  • 3 client/master nodes
  • 24 hot nodes
  • 42 cold nodes

It’s on version 5.3.1, running with OpenJDK on Ubuntu 16.04.

This cluster is queried to display a real-time dashboard accessible by our customers, so we expect a response in, at most, seconds.

ElasticSearch cluster architecture

Lesson #1: Make lots of indices (but not too many)

Like everyone building an ElasticSearch cluster, we started with only one index (think of an index as a ‘database’ in a relational database: it’s a logical namespace which maps to one or more primary shards and can have zero or more replica shards). After about a year, our single index had grown to around 3 TB so we created a new one. A few months later, that one got too big and we had to create a third index — and so on, and so forth. The company - and our customer base - continued to grow. Around index number 20, we were creating a new index manually twice per month.

That’s a lot of manual work, a little more than we, at Botify, like to do on a regular basis, so we started looking for solutions to have this done automatically.

We started brainstorming: what if we shard a lot? What if we create an indice every day? Is there a smarter way to do it, knowing our ingestion volume and rates?

This trial and error allowed us to define a clear strategy for defining our indices: since our data set is related to time-series data, we decided to create new indices every month. There are two constraints to this method:

  • making a lot of indices increases RAM usage, so there is a point of diminishing returns where it becomes counter-productive to create thousands of them
  • a big index is slow to move or operate if you need to

We found our sweet spot to be at 6 new indices per month, and we really did not want to create index manually all the time.

So we created a script that would, every month, create six indices, named index_YYMM_0[1–6]. When we need to allocate data for our customer crawl, we choose an index randomly and save the chosen index for the crawl in DB. Later, when querying the data, we can easily route the query to the right index directly. Using this method, we now have around 300 indices in our cluster and it works well for us.

Today this is a bit of a moot point. The correct way to implement this on a modern ElasticSearch cluster is using Index Lifecycle Management, which does what we described, but automatically. However this wasn’t available at the time, and we had to figure it out for ourselves.

Lesson #2: Shards, Shards, Shards

Since there’s no limit to the size of an index, you can easily have an outlier index that consumes all the space on our server, forcing us to buy extra servers with larger disk sizes, often at a much higher cost, when you can simply shard, since each index can be divided into shards.

Sharding enabled the horizontal split of indices, allowing us to buy the same hardware again and making sure all our nodes have the same capacity (in terms of CPU/RAM/disk) and can scale horizontally.

More importantly, shards allow for the distribution of operations between nodes. About 90% of our pageviews will be pertaining to the latest data, since customers mostly want to see their latest crawl. This allows for our latest indices to be located not on any single server, but on multiple servers, load-balancing the most requested data between multiple machines.

Lesson #3: Plan your cluster depending on shard+replica numbers

When we started planning the cluster we are running today (our fifth iteration), we tried to incorporate these lessons we had learned along the way. Once we had figured out our rhythm for creating indices and shards, we knew we could design the hardware stack with our software’s specific features in mind.

We decided to create six indices per month, each index split into 12 shards, and we wanted to keep our hardware as simple as possible, while maintaining high availability and resiliency. Since we also wanted to optimize for data availability using replicas, that meant that each index was made of 24 shards (12 primary and 12 replica). So in our quest to keep things simple, we bought 24 servers:

  • one shard per server per index. This means that losing a single server does not make the cluster turn red (only yellow) thanks to redundancy, enabling high availability.
  • we don’t have the pitfall of having one super hot server with half our indices on it, because of the automatic shard placement of ES said: “hey, why not?”.

Speaking of ElasticSearch’s automatic shard placement, here’s Lesson #4:

Lesson #4: Deactivate automatic placement of shards

By default, ES moves shards depending of the space left on a node. But since our indices are roughly the same size, we deactivated this and created a script to partitions shards for us, following the pattern we laid out in the previous paragraph. It is this script which forces each shard of each index to be located on one server only. Once in place, we just had to remove the automatic placement with this request:

curl -XPUT localhost:9200/_cluster/settings -d '{
"transient": {
"cluster.routing.allocation.cluster_concurrent_rebalance": "0",
"cluster.routing.allocation.node_concurrent_recoveries": "0"
}
}'

Lesson #5: Bare-Metal vs Managed

We first built our initial cluster on AWS, but we quickly realized that the cost would be too high. Storing 630TB of data on EBS, even at the cheapest possible tier, would cost 16k$/month — and that’s just in EBS costs, not including any machine costs! By our estimations at the time, running on bare-metal would potentially be at least twice cheaper.

We therefore decided to move the cluster to bare-metal servers. At the time, our benchmark yielded a best price-to-performance ratio on bare-metal instances to be at OVH. Among the few drawbacks that one should be aware of when making these kinds of build-or-buy decisions:

  • As seen in the later lesson #7, bare-metal means machine monitoring at a deeper level
  • Backups and data security have to be handled manually
  • If your customer-facing servers are not in the same cloud (like ours which are on AWS), you end up incurring extra costs: extra for the inter-cloud bandwidth, and extra milliseconds of latency for egress from OVH and ingress at AWS. Be prepared.

Lesson #6: Prevent shard rebalancing in case of failure

For whatever reason (network, power, etc), it may happen that one of our servers loses internet connectivity for an indeterminate amount of time. Since all of our servers have replicas, we ensure that single-server loss has no effect on our cluster.

However, if the move of shards is allowed, ES will start moving them around different nodes. This is particularly painful because of our heavy shards, for which it may take hours to move. Once you have blocked rebalancing, when a server comes back, it discovers the data on disk and just reingest it. This operation can take two hours tops for our largest nodes (with 25TB of data) versus a day to move everything in place.

curl -XPUT localhost:9200/_cluster/settings -d '{
"transient": {
"cluster.routing.allocation.enable": "none"
}
}'

Lesson #7: Move cold data to less powerful servers

As our cold data is rarely accessed, we decided to move it to cheaper servers with smaller CPUs, but with larger capacities and less redundancy. These cold nodes allow for a drastic reduction in price for our cluster, while sacrificing very little speed and redundancy.

For our hot nodes, we use servers with 7.3TB of disk after RAID5 (so around 6To usable, since we don’t want to go above 80% disk usage). For our cold nodes, we use servers with 29.6TB of disk after RAID5 (again, 80% limit, so 24TB).

This means that each cold node stores about the same amount of data as 4 hot nodes, but they remain powerful enough to keep queries responsive. This optimization means huge savings here, since a cold node costs about as much as two hot nodes.

Lesson #8: Monitoring!

Since we talked about moving nodes to less powerful servers, monitoring our cluster is paramount, to ensure tip-top shape no matter the hardware or the customer load.

Our monitoring using Grafana (760TB is with replicas, 670TB without)

Here is our Grafana dashboard for monitoring our ES cluster. Our most important metrics are:

  • JVM heap: All our servers have 64Go of RAM, so we set the JVM heap at 32Go. We noted that our servers are way more prone to node failure when our heap reaches over 90% usage, so we add nodes when needed, to scale horizontally.
  • Load: classic move here, we limit the ingestion speed if we reach 1.5 of load, because we’ve seen that above this threshold, the query speed is drastically decreased
  • GET/mGET: to stay on top of response times served to customers

Conclusion

Before moving to this new cluster, we used to have between 2 and 3 failures per month. On the new cluster, while the first six months were hard to come by, but once we found all the quirks exposed above, we stopped having issues with, which allowed our infrastructure team to focus on other subjects.

Overall, we feel that taming the finer points of ElasticSearch was a hard battle, but once we had everything set up and rolling, it quickly stopped being a pain point in our stack.

We never felt the need to move to newest ElasticSearch versions, but we’ve followed closely and it seems like they removed a lot of the pain to make ES scale. A lot of what we’ve had to do on our cluster would not be necessary today, with newer versions of ElasticSearch that account for these pitfalls.

Today we are sunsetting our ElasticSearch cluster. It’s had a great run and has allowed us to grow, but we’ve closely analyzed our data needs and decided that other datastores are a better choice for our data as it stands today and for Botify’s plans for 2021. We’re planning more blog posts on how we decommissioned 600TB+ of data on our main datastore, and what we are moving to and why. Stay tuned, more on that in a further article!

Interested in joining us? We’re hiring! Don’t hesitate to send us a resume if there are no open positions that match your skills, we are always on the lookout for passionate people.

--

--