OpenSearch: alerting and monitoring

Gerardo Landino
Data Reply IT | DataTech
7 min readApr 18, 2024

Introduction

Monitoring and alerting are essential components of any modern IT infrastructure. In this article, we will delve into the world of AWS OpenSearch and how it can be utilized to set up robust monitoring and alerting systems. By leveraging the various features provided by OpenSearch, we can gain valuable insights into the health and performance of our systems, enabling us to proactively address any issues that may arise.

The article is based on a real use case for our customer. Our AWS environment has many applications; therefore, we chose to put in place a system to keep an eye on them all using centralized dashboards and an alerting system that sends notifications via Amazon SNS (Simple Notification Service).
Grab a seat and let’s get into it!

Setting up the environment

Figure 1: Architecture overview

The OpenSearch cluster is provisioned with Terraform (the provider is opensearch-project/opensearch with 2.2.0 version), which follows infrastructure-as-code principles and provides scalability and reproducibility. Terraform allows us to select several configurations, including the OpenSearch version (2.11), the instance type (r6g.large.search), and the required EBS volume type (gp3) with a maximum capacity of 2 TB. This technique promotes consistency between deployments while simplifying administration responsibilities.

Once the OpenSearch domain has been established (inside a VPC network), the focus moves to environmental protection and integration with other AWS services. To begin, a VPC endpoint was created for private access to our cluster. Other security measures, such as IAM roles and policies, were put in place to ensure data privacy and manage access to the OpenSearch domain. Furthermore, data encryption with the AWS Key Management Service (KMS) guarantees that data is secure both in transit and at rest. Fine-grained access controls, such as role-based access control (RBAC) and IP-based access rules, are implemented within OpenSearch to fortify the environment.

Ingesting Data into OpenSearch

Lambda functions are the essential parts of any application. Since each lambda function has its own CloudWatch log group, the logs are kept in CloudWatch and used for analysis following each execution.
The process of ingesting data into OpenSearch kicks off with CloudWatch. There is a feature in CloudWatch called subscription filter that provides a real-time feed of log event streams from CloudWatch Logs to other AWS services. In this way, users can create a filter to attach to the log group in order to manage only the interested logs. The target of the subscription filter is a Node.js lambda function that takes in the log data, parses the information, converts it to an acceptable format for OpenSearch, and finally pushes the data into an OpenSearch index (in order to prevent our indices from growing exponentially and slowing down the search and indexing process, we created daily indices).

function transform(payload) {
if (payload.messageType === 'CONTROL_MESSAGE') {
return null;
}

var bulkRequestBody = '';

payload.logEvents.forEach(function(logEvent) {
var timestamp = new Date(1 * logEvent.timestamp);

// index name format: shs_qual-YYYY.MM.DD
var indexName = [
'shs_qual-' + timestamp.getUTCFullYear(), // year
('0' + (timestamp.getUTCMonth() + 1)).slice(-2), // month
('0' + timestamp.getUTCDate()).slice(-2) // day
].join('.');

var source = buildSource(logEvent.message, logEvent.extractedFields);
source['@id'] = logEvent.id;
source['@timestamp'] = new Date(1 * logEvent.timestamp).toISOString();
source['@message'] = logEvent.message;
source['@owner'] = payload.owner;
source['@log_group'] = payload.logGroup;
source['@log_stream'] = payload.logStream;

var action = { "index": {} };
action.index._index = indexName;
action.index._id = logEvent.id;

bulkRequestBody += [
JSON.stringify(action),
JSON.stringify(source),
].join('\n') + '\n';
});
return bulkRequestBody;
}

The indexed data are JSON documents containing specific fields such as country (the country from which an application is invoked), message (that may contain useful information such as a lambda payload), correlationID (a unique identifier used to track the execution), service (the invoked application), and other fields.

{
"country": "Italy",
"message": "The process starts with the following payload {'id': 'random','filename':'test'}",
"correlationID": "6fb7c7d7-915c-40ea-83e0-44af0dd098a0",
"service": "MLApplication",
...
}

Creating Dashboards and Visualizations

Creating dashboards and visualizations is a very simple process that can be easily adapted to match individual analytical needs.
We started the process by creating an index pattern, which is a fundamental component that determines which OpenSearch indices will be the sources of our dashboard data. To include all of our indices across both AWS accounts, we set the index patterns as ‘shs_env-and‘e2e_env-’: these patterns ensure coherence and consistency in data retrieval across various visualizations.

When we went on to the dashboard construction stage, we were met with a flexible wizard interface once the index pattern was set up. We find a variety of visualization options inside this interface, from simple pie graphs to more complex line charts, all of which are intended to offer insights into various parts of our data. The data table, histogram, and pie chart are the principal visualizations that we selected, and everyone is designed for a specific purpose (e.g., a pie chart is used to identify the various countries where an application has been invoked).
In the end, we produced the final dashboard, which consists of a collection of several visualizations.

Figure 2: Country pie chart

Index Management and Data Lifecycle

The maintenance of optimal system performance and resource utilization in OpenSearch is based upon the implementation of effective index management and data lifecycle strategies.
Index State Management (ISM) policies, which automate index maintenance tasks based on established criteria, are one useful strategy. For instance, let’s assume we have time-series data being collected and stored in OpenSearch indices over time, and we need to manage these indices, which would otherwise grow indefinitely. An ISM strategy that sets off events when indices reach a predetermined age (in our case, 15 days or 30 days) can be established. When this threshold is reached, the policy triggers some actions. Initially, it creates a rollup index by combining the information in the original index to minimize storage costs while maintaining crucial insights. Then, when the rollup index is created and verified, the ISM policy deletes the original index, preserving data retention guidelines and freeing up important resources.
By keeping aggregated data for historical reference, this methodical index management approach not only improve storage utilization but also makes future analysis easier.

Figure 3: Index State Management

Setting up Alerts and Notifications

Alerts and notifications are essential for proactive monitoring and responding quickly to faults or anomalies. OpenSearch allows us to set up alerts based on specific conditions through the “Alerting” feature. We used AWS API Gateway to provide access to our applications, so we needed to monitor all requests to our API Gateway. As a result, we created a system to control API error codes and detect inactivity. This system is based on three main components:

· A monitor job, that runs every five minutes to check every API error code such as 4xx, 5xx;

· A trigger, that is activated when meet an error code

· An action, that sends an email notification

Figure 4: Alerting monitor

To receive notifications, we set up a communication channel via an Amazon Simple Notification Service (SNS) topic. This SNS topic was configured to send notifications to various destinations, in particular to developers email addresses, ensuring that the right team members are alerted promptly.
To improve monitoring capabilities, two CloudWatch alarms were setup to closely follow key parameters in the OpenSearch system.
The first alert monitors the free storage space of the OpenSearch cluster: we chose a threshold equal to 25GB because the cluster blocks write operations at 20GB. Administrators can receive notifications when available storage capacity reaches a critical level, allowing them to take early steps to avoid storage-related difficulties.
The second alert monitors the health status of the cluster, focusing on instances when the cluster entered a “red” state. This “red” state denotes a failure in health checks, indicating potential performance degradation or service interruptions.

Figure 5: CloudWatch alarms

Conclusion

In the end, establishing monitoring and alerting with AWS OpenSearch provides organizations with a comprehensive solution to monitor the health and performance of their AWS-hosted apps. Administrators can create a durable environment suitable for data analysis by implementing accurate setup processes such as Terraform-based cluster setup and reliable safety setups. Integration with AWS services such as CloudWatch Logs and Lambda simplifies data ingestion and allows for real-time monitoring.
Organizations may receive meaningful insights and proactively address issues using OpenSearch’s index management, data lifecycle, and visualization capabilities, resulting in optimal performance and user experiences.

--

--