Running Curator as a Kubernetes CronJob

A photo of visitors in the galleries at the Clyfford Still Museum. Photo by Rebecca

The de-facto standard in logging solution out there is the famous elastic stack: Elasticsearch, Logstash and Kibana (fondly referred to as elk). It allows gathering and centralizing logs from all system components, making it easier to view and analyze them.

While running a containerized distributed application, the importance of a centralized logging solution is even greater, as you have lots of small bits of logs running around everywhere and you need some place to have an aggregated view of them all.

One of the problems you might encounter is having too many of those logs indexed in Elasticsearch and your storage gets filled up. Granted, Elasticsearch is optimized to store textual data, but we live in a finite world, and eventually you’ll run out of space. One approach could be to allocate more storage to your data nodes (or more data nodes to your Elasticsearch cluster), but that doesn’t really solve the issue, it just delays the inevitable. At some point in time, you’ll need some way to manage those indices — archive, delete them etc.

Enter Curator

Elasticsearch Curator helps you curate, or manage, your Elasticsearch indices and snapshots by:
1. Obtaining the full list of indices (or snapshots) from the cluster, as the actionable list
2. Iterate through a list of user-defined filters to progressively remove indices (or snapshots) from this actionable list as needed.
3. Perform various actions on the items which remain in the actionable list.
(from curator about page)

Curator started its life as a python script that just deleted Elasticsearch indices, but evolved to a powerful and capable utility that can run various actions against your Elasticsearch cluster. It’s an invaluable tool to anyone running Elasticsearch clusters and needs to manage them.

One of the strong points of Curator is its ability to store the actions and hosts configurations in files. This allows us to define a set of actions in one file, the Elasticsearch host details in another, and pass them on to Curator from the CLI to do its business. Most of the actions you’d want to run against your Elasticsearch cluster are periodical, meaning they should run on a repetitive schedule.

Enter Cron

The concept of cron job isn’t new in the *nix universe (you can read more about it on Wikipedia). Cron gives us the ability to run a periodic action (“job”) in various forms (e.g. shell script) using a special syntax to describe when this job should run. This fits our Curator needs nicely as “cron is most suitable for scheduling repetitive tasks” (from the Wikipedia page)

But, we need to consider what happens while running in a containerized application. On one hand, such systems are dynamic by nature. Containers start and stop constantly and the system is always in a state of flux and change. On the other hand, we want to run Curator on periodic and stable schedule, and Curator is a short lived job. We could install and run Curator directly from one of the nodes in our cluster, but that isn’t very fault tolerant as it means that should we lose the node (e.g. due to machine failure) we lose our Curator. Again, not a very fault tolerant or durable setup.

Enter Kubernetes CronJob

A Cron Job manages time based Jobs, namely:
* Once at a specified point in time
* Repeatedly at a specified point in time
(from the kubernetes CronJob docs)

A CronJob resource is a higher level controller (much like the Deployment resource), meaning it manages other controllers, a Job controller in our case. A Job resource in Kubernetes is defined as -

A job creates one or more pods and ensures that a specified number of them successfully terminate. As pods successfully complete, the job tracks the successful completions. When a specified number of successful completions is reached, the job itself is complete. (from the kubernetes Job docs)

This allows for a short lived pod to execute code until completion (e.g. exit code 0 from a shell script) as apposed to a long running pod which is usually managed by a different controller (e.g. ReplicaSet) and isn’t intended to terminate at all. Think of an nginx pod that you’d usually want it to run indefinitely serving web pages or doing other stuff. Such a pod will always run and we don’t expect it to ever terminate. Now consider a short lived pod (such as Curator) that once it finishes its business successfully, gets terminated.

The CronJob resource was introduced in Kubernetes 1.3 as ScheduledJobs and matured in 1.4 (as an alpha feature). It was later renamed and moved to the batch API group. Currently (as of Kubernetes 1.8) it’s considered a beta feature and is part of the batch/v1beta1 API group. As it is a beta feature, it should be enabled by default on clusters running an API server ≥ 1.8 (see this link for details on how to enable it on earlier API servers). To find out if your cluster has the API group enabled, run kubectl api-versions , the output should look like this —

Note that the batch/v2alpha1 API group is enabled in my cluster, and I’m running an earlier version of the API server (1.7.2 in my case)

Putting it All Together

So now that we know what Curator does and have an understanding of how Kubernetes CronJobs work, let’s tie the knot and put it all together.

I’ve writing a Helm chart that installs Curator in a running Kubernetes cluster. It’s hosted on GitHub, go ahead and take a look.

We’ll want to define two resources, the CronJob itself and a ConfigMap that gets mounted as a volume in to the pod and defines Curator’s actions and host details. Using a ConfigMap to manage Curator’s configuration decouples it from the pod itself and the job definitions and allows for greater freedom in manipulating Curator’s configuration.

The CronJob definition

This uses the fantastic bobrik/curator:5.1.1 image from Docker hub, written by Ivan Babrou (https://twitter.com/ibobrik). Go check it out, the credit is all his.

Let’s take a closer look at the CronJob manifest. The interesting parts are under the spec field:

  • The schedule is cron syntax and means “run every 01:00 AM daily
  • The following fields define retention policy for failed and successful jobs, which are useful if a job fails and you want to debug the cause
  • We don’t want to run more than one concurrent job at a time, so concurrencyPolicy is set to forbid
  • We give the job 120 seconds to start before it gets terminated and another one scheduled
  • the spec.jobTemplate.spec.template defines the actual pod definition to run. The interesting thing to note here is that we mount a ConfigMap as a volume at /etc/config and start Curator pointing to files located in that path

The ConfigMap definition

This will put two files in the ConfigMap volume: action_file.yml and config.yml. The config.yml defines the Elasticsearch host and port to run Curator against. The action_file.yml defines a single action (delete_indices) filtered on indices older then 7 days.

Running helm install kube-charts/curator will install the chart in the cluster, that will run a Curator job every 01:00 AM daily, deleting indices older then 7 days in an Elasticsearch host found at http://elasticsearch-logging:9200 . Isn’t that neat?

Summary

Curator can make managing indices in Elasticsearch simpler, and combined with Kubernetes’ CronJob resource can help run those pesky repetitive jobs easier and in a manageable fashion, and at the same time in a configurable, fault tolerant, and production grade setup.

While we looked at the delete_indices action, Curator can do much more and I encourage you to go check out its docs for a full list of its capabilities.

You can find the Helm chart in my GitHub repo. Of course feel free to have a look and leave your feedback. It was written with assistance of the wonderful Maor Friedman, thank you!