Keep Your OpenSearch Indices Healthy With Automatic Maintenance Jobs

Published in

saas-infra

2 min readDec 19, 2022

Background

OpenSearch is a great way to monitor and query your logs. Over time management of this system can become a strain on your ability to have a stable state of log retention.

Creating a cron job to automate manual fixes will save you time, money and ‘missing’ logs.

Our team manages the infrastructure of over 25 Amazon OpenSearch Service clusters with over 3TB a day across. Using Index State Management (ISM) in OpenSearch is very common to remove unneeded indices. ISM lets you define custom management policies that automate routine tasks, and apply them to indexes and index patterns.

Problems

From our experience, over time invalid states or misconfigurations of Index State Management occur:

The indexes never transition to state ‘delete’
Action timed out
Failed to rollover index
Failed to evaluate conditions for rollover
Previous action was not able to update IndexMetaData
Pending rollover of index
Attempting to rollover index
New indices added, without ISM policies configured

This can cause serious implications like the following that can cause outages or missing logs:

Maximum shards reached
Lack of available storage

The wrong solution

The straight forward and expensive solution to address these problems will be to throw more money into resources of CPU/Mem/Storage.

Prerequisite

Create a ‘default’ ISM policy with low priority and a wildcard index pattern that matches all indices that require state changes.

Example for such wildcard index pattern will be all indices from year 2020 and onwards (“*202*”):

"ism_template": [
            {
                "index_patterns": [
                    "*202*"
                ],
                "priority": 1
            }
        ]

Making the script for the automated jobs

Create a script to run as a cron job (3 times a day)

Queries all ISM policies on Index older than today’s.

GET _opendistro/_ism/explain/*

Parse what is returned.

If Failure in state:

Remove policy and add ‘default’ policy.

POST _opendistro/_ism/remove/<index_name>

POST _opendistro/_ism/add/<index_name>
{
“Policy_id” : “defalt_policy”
}

If Missing Policy:

Add ‘default’ policy.

POST _opendistro/_ism/add/<index_name>
{
“Policy_id” : “defalt_policy”
}

Use best practices like SOLID principles to allow you to update and modify the script as needed.
The script was integrated into our Terraform for Opensearch deployments as AWS Lambda

Summary:

Logs are needed for all parts of your organization, troubleshooting and alerting. Many times you are only alerted about ‘no logs’ when someone actually needs the logs. Create a cron job script that allows you to fix any ‘manual’ chores you need to do on a regular basis to fix the ‘no logs’ and this can allow you not to get that Developer asking you “Where are the logs?” situation.

-Yaakov