Adding Retry Logic to Splunk Webhook Alerts

Splunk is the great an powerful wizard of the log analytics world. Those who have maintained it previously shudder at the operational headache it can be, but those who use its search abilities, greatly miss Splunk’s offering when moving to alternatives like Logstash powered ElasticSearch and SumoLogic.

Aside from Splunk collecting logs and having a great search interface, Splunk also allows you to setup alerting, much how elastalert does for ElasticSearch. Splunk gives the ability to save a search and alert if a threshold is exceeded or if there is a lull in collection for a defined period of time. As with everything in the year 2017, integrations are what make these alerts powerful. Using integrations such as webhooks, email alerts and script executions allow you ingest these alerts to your custom tools and workflows.

While the alerting, searching and indexing is top-notch, webhook support lacks. One major pain point is the lack of webhook retries. In the event your webhook fails to return a successful status code, that alert will be lost forever.

Mind The Gap

When alerts fire, a default entry is made into the _internal index. For webhooks, you can narrow this search down to action=webhook. You’ll normally see “Sending POST …”. When the alert fails, the next log entry is responsible for returning the error. Here is an example of the alert webhook failure:

03–27–2017 08:00:24.685 -0700 INFO sendmodalert — Invoking modular alert action=webhook for search="<ALERT_NAME>” sid=”<JOB_SID>" in app=”search” owner=”<ALERT_OWNER>” type=”saved
03–27–2017 08:00:24.723 -0700 INFO sendmodalert — action=webhook STDERR — Sending POST request to url=https://webhook-receiver:8443 with size=414 bytes payload
03–27–2017 08:00:24.726 -0700 ERROR sendmodalert — action=webhook STDERR — Error sending webhook request: <urlopen error [Errno 111] Connection refused>
03–27–2017 08:00:24.730 -0700 INFO sendmodalert — action=webhook — Alert action script completed in duration=44 ms with exit code=2
03–27–2017 08:00:24.730 -0700 WARN sendmodalert — action=webhook — Alert action script returned error code=2

Looking at the results, we can see there are no common unique id’s shared between the messages. As a result, in order for us to detect which requests failed and link them back to a SID, we must us a transaction.

index=_internal action=webhook sendmodalert 
| transaction action maxspan=2s maxevents=5
| where like(message,"%error%")

The above search will match only messages with action=webhook, with the string sendmodalert present and that contain the string error. There are 5 events returned, which is why we set maxevents=5. You can always play around with the maxspan=2 if needed, however this depends on your webhook receiver and average latency or timeouts.

Putting It All Together

Once we are able to detect all the failed responses, using the transaction, we identify the original SID. This SID is what stored the original payload for the Splunk alert which fired. We are going to need these payloads to retry the original webhook to prevent a gap in coverage and alerting.

This is where loadjob comes in.

Using loadjob we are able to take the failed webhook SID, and pull in the results. In order to chain this to our failed webhook search results, we use map which will run loadjob for each of the failed webhook results.

Our webhook retry alert will now look like:

index=_internal action=webhook sendmodalert 
| transaction action maxspan=2s maxevents=5
| where like(message,"%error%")
| map maxsearches=10 search="|loadjob $sid$"

Summary

The only two gotchas that I have been able to identify is that the original Search Name is not returned by the SID. You can either manually add in a search_name field to your original alert search, or ignore this completely.

The second gotcha I encountered is a normal Splunk-ism in which you need to set Alert mode to Once per result. This will generate individual alerts for each of the found failed results.

Using alert integrations are a hard-requirement for anyone or team looking to mature and grow their visibility around data. This is true for SecOps, DevOps or BigData.