Observability using OpenSearch + Grafana

Jishnu Srivastava
5 min readJul 11, 2024

--

Prologue

After reading the title if the first thought to pop up in your head was: “Why not just use OpenSearch with Kibana?!”, well that makes 2 of us.

When this weird(ish) requirement came to me that was my first question as well. Turns out, the client already had OpenSearch with Kibana setup :)
So why the extra work of integrating OpenSearch with Grafana! Turns out, 1) they wanted to have a single pane view of their APIs, AWS Infrastructure metrics, Logs and APM metrics, and 2) they were more familiar and comfortable with the workings and the look and feel of Grafana.

Story Time

With that cleared out, I got down to work. Turns out Grafana has plugins for both ElasticSearch and OpenSearch. The catch is, the ElasticSearch plugin only supports versions > 7.16. The clusters I needed to visualize data from were running on version 7.10, so I proceeded with the OpenSearch plugin instead.

Approach

I hit the very first roadblock when configuring the data source. After configuring the OpenSearch url and the necessary credentials in the data source, when I clicked on save and test it got saved and did not throw any error or warning. But I should’ve known that my luck isn’t so good to get it right on the first try. When I went to explore, there was absolutely no data. I believed it to be a permission issue on the credentials I had been provided, but it was not so, because the data was visible in kibana using the same credentials. It took quite a lot of tinkering to figure out that the issue was with Time field name” value in the data source configuration. This field defaults to the value: “@timestamp”, but turns out in all the indexes, whose data I was to work on, the time field name was in the “logtime” field and no “@timestamp” field existed as such. Once this was changed, et voilà, there was data!

The Time field name defaults to “@timestamp”

Now, the next steps was to build visualizations, pretty simple and straightforward right? Nope, definitely not! First off, there are no pre built dashboards in the Grafana community which has panels for the OpenSearch data, and that actually makes sense because you see every OpenSearch index has logs data and each log has various fields which is not standard across the applications or even across the industry, hence each dashboard requires customizations.
It took a few hours for me to understand the log fields, the value in them, the data type of values and then come up with some panels. One of the most basic panel was that of the important fields from the logs visualized in a table. This was possible by just changing the Metric from Count to Logs and then adding a few transformations. Easy-Peasy!
Then came the actual work. It took quite a lot of attempts, a lot of reading, going through the community and multiple articles to understand how I could combine data from various fields from logs and use the Grafana transformations to make meaningful visualizations. Some of them cover the latency per API, the status code and also the number of hits per API; others are a bit more complex. (In the images attached below some data is hidden for avoiding the obvious data infringement issues)

Count of hits
API and Status
Latency per API

Now that that was done, only alerting was left. Alerting was one aspect which I believed would be the easiest to setup. Boy, was I wrong! I had overlooked a crucial detail: Grafana alerts are based on numeric data, not log data (string or JSON). This posed a significant problem. Alerting is one of the major offerings for any observability or monitoring platform, and to not have that is like flying a plane without a navigation system — you’re moving forward, but you have no way to know if you’re headed in the right direction or if there’s trouble ahead.
To make matters worse, there was minimal help available in the open forums or community. Most articles were related to Loki alerting, but that did not solve my purpose. So I had to sit down and reconvene. It was around 11:30 in the night, a heavy downpour was constant from the evening and still going on, when I decided to switch off the lights so that it was just me and my system. I was focused on the pitter-patter of the rain and not even on the problem when it hit me like a bolt of lightning. The solution was simple: I needed to input a Lucene query related to the alert condition, group the terms according to the metric, and then group it all according to the timestamp. This would convert it into time series data on which alerting was possible. Once I set these configurations in place, the alerts finally worked as expected.

Alert: Status not 200
Alert: Latency > 1s

Wrap-up

With the visualizations and alerts in place, all that was left was to setup the contact points which would send the firing and resolved notifications to my teams channel and to a custom python application which makes a ticket on the ITSM tool. Configuring this was a cakewalk in comparison to the other milestones, all it took was a teams incoming web-hook as a contact point and another custom web-hook as the second contact point.

All in all, it was a real head-scratcher and brain-bender, but a fantastic learning experience!

--

--