Kambr
Published in

Kambr

How to fix error 429 on Elasticsearch— Data too large — A Beginner’s Mistake — Kambr Tech Story

My journey with Elasticsearch and the ELK Stack, started two years ago, in 2020. I had just started at Kambr and a new team was just set up for thinking about how we would be able to monitor our applications. At the time we only had one application and one client in production. This application was a monolith, it was deployed in a Kubernetes cluster, which was our only Kubernetes cluster at the time. Inside this Kubernetes cluster we had a couple of containers, but only one we really wanted to monitor.

We didn’t know how to use the ELK Stack, and to be honest, we also didn’t know how to extract value out of it yet. I started looking for ways of deploying the whole stack, the best strategy to deploy it in Kubernetes, how to ship the logs from the pods, how to parse the logs, how to create dashboards and how to extract meaning out of all this information…. All of this happened in a short period of time, of course. It couldn’t go right the first time.

One of the strategies we decided to apply was the default one for the creation of indices. When you first deploy Logstash, in the example they share, we create a new index for each day of the year.

We deployed Logstash in our only Kubernetes cluster and we let all the logs to be shipped to our Elasticsearch. Logstash would create a new index for storing all the logs coming from this cluster on a daily basis, which meant 365 indices in one year.

As we started growing, we started deploying new Kubernetes clusters, installing Logstash in them and we moved to a microservice strategy, which meant more applications, hence more logs. More indices started being created in our Elasticsearch cluster.

After one year, we started exploring other ways of monitoring our applications, and that was when we decided to build a Python script to fetch business metrics from our database and ship them to our Elasticsearch cluster. We were bypassing Logstash in this implementation, and in one or two months, the bomb dropped:

Monitoring App has stopped: TransportError(429, 'circuit_breaking_exception', '[parent] Data too large, data for [indices:data/write/bulk[s]] would be [3093097642/2.8gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3093096448/2.8gb], new bytes reserved: [1194/1.1kb], usages [request=0/0b, fielddata=1543483/1.4mb, in_flight_requests=1194/1.1kb, model_inference=0/0b, accounting=36585636/34.8mb]')

What the hell does it mean? Is our Python script creating messages that are too big to be stored? Are we shipping too many messages at once? Do we have too many Logstashs running alongside and our Elasticsearch cluster cannot handle them? What the hell does it mean?

It wasn’t the priority, so I had to research about it mostly in my spare time. After spending a couple of weeks looking into all corners of Google, I was able to understand what was going on.

What I found out was that Elasticsearch has a limit of shards it can handle, it depends on the heap memory you set for your nodes.

What are shards? That’s what I asked myself first. And I probably wasn’t the only person to ask because the Elastic group had a story on exactly what my problem was:

https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

According to the Elastic group, in a nutshell:

Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.

Ok, it means that for each index, I will have at least one shard. I have more than 900 indices already. And, well, Elastic says:

A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better.

I certainly don’t have 30GB of heap available for my nodes.

I had to get rid of hundreds of indices, which meant that I had to get rid of thousands and thousands of logs I had been keeping. And a few of these logs I certainly wanted to keep for a long time.

I can just merge a couple of indices. I have been separating them by day. I can probably start separating them by months instead. But wait, another tip by Elastic:

For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.

Damn, I had been using Elasticsearch wrongly for more than a year. What should I do now?

Deleting the Logs I Didn’t Need

My first idea was to start by deleting the logs I didn’t need. Elasticsearch has an API for that: DELETE BY QUERY . You can read its documentation here.

Good, I know all the logs I need to keep have a specific term, I can just filter by the logs in which these terms cannot be found and get rid of them.

If you have a heavy index, as I did, you will probably want to use the slicing tool for parallelizing this task and making it more efficient. Otherwise, it might days or months to delete all the documents from your index. You can see how to use slicing by clicking here.

Reindexing Existing Indices

As I said before, we were storing all indices by day of the year and by Kubernetes cluster, which means that we would have indices like:

  • k8s-cluster-1–2022.02.23
  • k8s-cluster-1–2022.02.24
  • k8s-cluster-1–2022.02.25
  • k8s-cluster-1–2022.02.26

The first thing we had to do was reindexing all of these existing indices.

At this point, we already deleted all logs we didn’t need and we knew that each day of data wouldn’t go over 5MB. This means each year would have around 2GB, which is perfectly fine for the size of an index. So we decided to reindex them by year instead.

We first ran the reindex API:

POST _reindex?wait_for_completion=false
{
"source": {
"index": "k8s-cluster-1-2022.*"
},
"dest": {
"index": "k8s-cluster-1-2022"
}
}

This request will take all the clusters with the prefix k8s-cluster-1-2022. and reindex them to a new one with the name: k8s-cluster-2022 . By setting wait_for_completion=false we’re telling the API that we don’t want to wait for the completion of the task. It will respond with a task ID.

We can check the status of our task by requesting:

GET _task/<task_id>

And as soon as it’s successfully completed, we can delete the old index with the DELETE API. Make sure you have backup before performing any of these operations.

DELETE /k8s-cluster-1-2022.*

This will delete all the indices with the prefix k8s-cluster-1-2022.* . Pay attention to how I left the . after 2022 , this is to make sure I’m not also deleting the index we just created that is storing all the data.

You can check the documentation for the Reindex API here.

Stop Shipping All the Logs from the Kubernetes Cluster

The default configuration from Logstash was shipping all logs it could find within the cluster. Which means that it was shipping logs from pods I didn’t care to analyze. It was certainly increasing the size of our indices. I had to change the configuration of Logstash to ship only the logs I needed.

Changing the Strategy for Logs Retention

As I said before, I already knew which logs I wanted to keep, all of them were tagged with a specific term. My next task was changing the way Logstash would ship the logs to Elasticsearch. The strategy we decided to follow was to have two kinds of indices for the same index.

Instead of having something like:

  • my-index-1–2022.02.23

I wanted to have:

  • my-index-1–2022
  • my-index-1–ephemeral-2022.02.23

With this new strategy, I would have the benefit of not storing too many indices for the logs I wanted to keep. Since I knew these logs wouldn’t go over the gigabytes limit, I could index them by year. And also, I would be able to discard the indices that were ephemeral after a couple of days. I could set up an Index Policy that would be applied automatically to all ephemeral indices for that.

Conclusion

There’s nothing wrong with being a beginner, everyone has been a beginner before they could master anything in their lives. There’s no other way to learn. This is not the only beginner mistake I have committed, and certainly won’t be the last.

However, by understanding what the problem was, we were able to apply the correct solution and strategies to avoid it from happening again. If you’re new to Elasticsearch, make sure you find the right strategy for creating and maintaining your indices, otherwise you might have to pay for it later.

I hope you have enjoyed this story. See you around!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store