Making Splunk part of the platform

Published in

FT Product & Technology

9 min readMay 25, 2018

By Peter Schubert

This is a story about how we implemented a monitoring solution by bringing Splunk, our log aggregation tool, into our platform and using it as a data store. I will talk about the decision making process and the challenges we encountered, as well as the way forward.

Problem

The content that FT journalists produce, along with the metadata it is annotated with (for example, to say that this article is ‘about’ Tim Cook, or ‘mentions’ Apple), is published to our platform (Universal Publishing Platform — UPP) in order to be made available to clients through public APIs. Knowing how well publishing works is core to our activity, therefore we need reliable ways to monitor transaction success, as well as SLA compliance (i.e. publishing within 2 minutes). Also, we need to know at any point if there are failed transactions that still need to be sorted out manually.

The current solution for content publishing monitoring is a single service that polls relevant endpoints for content availability, until it is either found or the SLA timer runs out. This gives us great confidence that transactions complete, but there are some drawbacks, e.g. there is high coupling to other microservices and often becomes a chokepoint for teams working concurrently.

Approach

When we started work on monitoring annotations publishing, we decided to change the approach and break down the logic into separate concerns. Each relevant service would log their main event, which would be stored and queried by specialized services that encapsulate the success criteria for a transaction.

We initially looked at DynamoDB to store the monitoring events, but we couldn’t index on multiple fields and that proved to be a showstopper. While we looked at alternatives (such as Amazon RDS), we also started considering Splunk as a viable option.

There were some obvious advantages :

We were already logging to Splunk as a default
No need for an additional database
Querying and reporting are available for everyone

Collecting events

We updated our logging library to provide an API for logging structured events and aligned all relevant services to produce event logs for key events in the publishing lifecycle. We also needed to make sure our events made it to Splunk reliably.

Challenge 1: Timestamp parsing

As we are using the Splunk HTTP Event Collector (HEC), the normal timestamp parsing settings do not apply.

We struggled with this for a while until we learned that instead, we need to use the HEC event metadata to specify the timestamp explicitly, which we needed to extract from the message payload ourselves. It only supports millisecond precision (even though Splunk itself works with nanoseconds), but that should suffice.

This also opened an interesting possibility, that of “forging” the event date, which came in handy later on.

Challenge 2: HEC resilience

Two years ago, Jussi Heinonen wrote a great piece about how we introduced the HTTP Event Collector. Here is a diagram showing the log collection flow at the time:

For the most part, this works well, but once we started caring more about log data accuracy, we noticed that we had events that would simply never make it to Splunk.

It turns out that HEC may return an error status code (e.g. HTTP 503) or even had longer outages, and we were losing significant amounts of logs. It became apparent that we needed to retry the call. This becomes problematic when you have millions of events pouring in within the hour.

Splunk HEC error statistics collected in Grafana

The solution was to buffer all events externally to our services, and we chose S3 for that.

Anytime a call to Splunk HEC fails, we put the message (batch) back in S3 to be retried at a later date. Also, consecutive failures trigger an exponential backoff mechanism that allows a potential congestion to clear up.

Challenge 3: HEC node stickiness

Splunk HEC has a load balancer and calls get routed to different indexer nodes. Whenever a node fails, the load balancer should remove it and use the remaining ones to route requests to.

However, on several occasions we found that even though HEC itself had recovered, we were still hitting a node that was down. We couldn’t quite find out why, but suspect it is because our pooled HTTP connections remain sticky to that node and don’t go through the load balancer again. The remedy in this case was to restart the services, which is evidence to support the theory above.

Challenge 4: Kubernetes

While all this was going on, another team was busy migrating the platform to Kubernetes, so we had to make our stuff work on the new setup as well.

On the old platform, we had one systemd unit bundling the logfilter and logcollector services, and we would have one instance of this unit running on each VM, so that we could collect the journal of that machine. In S3, we could prefix objects with the machine tag, and have each forwarder only read with the prefix of its own machine. This way, forwarder instances were not competing for the same data.

In the Kubernetes world, there was a variable number of machines, therefore prefixing was no longer feasible. Instead, we decided to separate the forwarder into a proper microservice (living in a pod rather than a systemd unit) and having just one instance per cluster.

Our tests have shown that the single instance is able to cope with the volumes of data it needs to process, while still being able to have multiple log collector instances (one per journal).

Challenge 5: Forwarder resume

The previous solutions covered any errors occuring in operation. But what about outages of our forwarder microservice? This happened on deploy, or it could simply fail for unforeseen reasons. We knew that the forwarder always picks up the journal from “now”, so we were once again faced with leaking events.

To sort this out, we recorded the time in etcd after each shutdown and read it at each startup. We could still miss less than one second’s worth of logs, but that was a much better place to be than before.

All this resilience work around collecting events was mandatory for our goals, but it also fixed some long existing issues for the wider platform.

Monitoring

With the groundwork of having our data in Splunk done, let’s take a look at the actual monitoring flows. (Actually, in practice, the monitoring functionality often revealed forwarding reliability issues that we had to go back and fix.)

Here is a simplified diagram of the solution, with the new additions in orange:

The Event Writer and Event Store from the conceptual diagram became the Resilient Splunk Forwarder and Splunk itself, respectively.

On the business side, we have Splunk Event Reader that would execute specific Splunk queries using the Splunk REST API.

The Annotations Monitoring Service looks at open transactions and checks their specific success condition. If successful, it writes its own transaction termination event that is predated with the timestamp of the last event in that transaction. This has to happen as this service is the only place where we can determine whether a publication event has finished, either successfully or not.

Challenge 6: Query performance

We started out with querying for transactions grouped by transaction id, and Splunk’s transaction command comes in handy. However, queries took too long (around 30s) which was not something we could use. The solution was to query for individual events and do the transaction grouping in the code. Even though we were getting more events than needed (we couldn’t discard some of them until we looked at the whole transaction), it turned out to be an order of magnitude faster.

In terms of query performance, there are other general tips that helped, such as specifying the index, source and sourcetype, as well as the list of fields returned.

Challenge 7: Concurrency limits

Splunk limits the number of concurrent queries that can be executed per user, role or system-wide, based on a scaling equation. Regular users can do 3, power users can do 10 (but not more than 100 for all power users), and the system supports only 38 queries at a time (including the web UI!). With several clusters running at least 3 queries every 5 minutes, we could easily hit the concurrency limit. For added safety, we separated the Splunk REST API users between prod and non-prod, so that we have more slack in production.

Even so, if we were to extend the monitoring solution to content and other items, we would need to consider these limits and find ways around them.

Challenge 8: Transaction limits

Another pitfall of using Splunk transactions is that there is a limit on how many transactions can be returned in a query (defaults to 5000). This is problematic when we want to run SLA checks or outstanding transaction queries over longer periods of time (even as short as 16hrs). Breaking the interval down into smaller chunks can help, but it takes very keen awareness to even know it’s happening.

Challenge 9: Monitoring latency

With the job running every 5 minutes to close transactions, we have the last 5 minutes which we have to ignore in our reports and alerts. That causes some awkwardness in defining the querying intervals that made things a bit harder to understand. We’re mitigating that by providing as much clarity in the documentation as possible.

Challenge 10: Index failures

Despite all of the work above, we would still get alerts firing for no reason, or transactions not being closed.

We narrowed this down to Splunk index failures.You may have noticed at times a small warning icon on your search job saying that an indexer node was down and results may not be complete. Typical alerts and reports that look for the existence of events (e.g. errors) are not that affected by gaps in data, but where every event counts, we can’t afford to have index failures.

For the Splunk Event Reader service, we were using the streaming endpoint, which was more than happy to return partial results without any hint of an issue. Once we switched to the regular endpoint, we were able to pick up job messages which would indicate index failures.

Still, the job itself would be marked as successful and the messages mere warnings. We decided to be aggressive and treat any message as a failure, causing the whole call to fail and forcing the Annotations Monitoring Service to retry on the same interval. This would at least give us eventual consistency of the data, while still allowing false alerts to surface.

For Splunk Alerts configured in the UI, we added the $job.messages$ field in the email body, that would indicate potential index failures.

Unfortunately we have not yet been able to fully mitigate index failures. We’re continuing to try, and Splunk are working hard to find a suitable remedy. In the meantime, we are now collecting more accurate data on these failures, that can at least help us triage false positives quicker.

Conclusions and future considerations

All in all, this was an interesting adventure that helped us understand better what Splunk can and can’t do, while also improving cross cutting functionality of the platform.

That said, we are still not in an ideal place with the level of reliability we need from Splunk.

Luckily, there are several things we can do to improve this, such as using a Kinesis data firehose or some other intermediate collection mechanism for our direct needs, with Splunk remaining there as a long term archive, or even eventually replacing Splunk altogether.