Turning the Data Firehose on Splunk

Published in

Feedzai Techblog

9 min readJan 8, 2021

Source: https://2.flexiple.com/scale/all-illustrations

Splunk is a software platform widely used for searching, analyzing, and visualizing machine-generated data gathered from the components that make up your IT infrastructure and business in real-time.

One of the really great things about Splunk is that feeding data into the platform is as simple as pointing it at a data source, teaching it a thing or two about the source and, poof…! That source, like the ugly bird in the Ugly Duckling story, becomes a beautiful swan. Oops! Data input. I meant it becomes a beautiful data input.

But how do you actually do it?

The material presented in this article provides an overview of how we at Feedzai leveraged the power of AWS Services, in particular Amazon Kinesis Data Firehose, to build a fully managed, reliable, and scalable serverless data streaming solution to Splunk.

Data Collection Architecture

To help you ingest data as easily as possible, Splunk supports a broad variety of data collection mechanisms, each of which purposely built to attend to different collection needs.

The most common method to get data into Splunk is by using the universal forwarder (UF), an agent-based data collection mechanism with minimal resource requirements and that makes little to no impact on performance of the machines where it is installed.

The UF provides, among other things:

Checkpoint/restart function for lossless data collection;
Efficient protocol curtails network bandwidth use;
Throttling capabilities;
Native load-balancing capabilities across existing indexers;
Network encryption using SSL/TLS;
Data compression; and
Parallel ingestion pipeline support to increase throughput and reduce latency.

Indeed, this is all great. But what if you have all your infrastructure entirely built upon AWS Cloud infrastructure and AWS Services?

Imagine, for example, that you already have all your instances configured to send relevant log data into Amazon CloudWatch service, using the unified CloudWatch Logs agent. All your application logs, web server logs, system logs, and so on. All working. All going into a centralized location. Everything is all neat and tidy. And yet you’d be installing another agent on your instances.

I know that, at this point, I’m oversimplifying a bit, as there are things that you can do with Splunk’s UF that you cannot with the CloudWatch Logs agent, namely executing scheduled scripts for additional reporting, or setting default values for “index” and “sourcetype” for monitored files. This is just for the sake of avoiding agent sprawl, avoiding duplication of outbound data from servers, and reducing operational management.

The solution we’ve put in place to avoid this agent sprawl is based on Splunk’s HTTP Event Collector (HEC) and Amazon Kinesis Data Firehose, with services such as Amazon CloudWatch, Amazon Kinesis Data Stream, AWS Lambda, and Amazon S3 also making an appearance. The architecture diagram is as follows:

Infrastructure supporting cross-account log data sharing from CloudWatch to Splunk

By building upon a managed service like Amazon Kinesis Data Firehose for data ingestion at Splunk, we obtain a serverless architecture for data collection, with no additional forwarders to manage or configure, that provides out-of-the-box reliability and scalability.

Back in December 2017, a few months after Splunk and AWS jointly announced that Amazon Kinesis Data Firehose would start supporting Splunk Platform as a delivery destination, Tarik Makota, Principal Solutions Architect at AWS, and Roy Arsan, Solutions Architect at Splunk, wrote an article outlining the guiding steps to implement a similar architecture. Nevertheless, there are two key differences between their architecture and ours. First, our sources of data are multiple CloudWatch Logs Subscriptions Filters dispersed through several AWS accounts. Second, our receiving service is a bunch of CloudWatch Logs destinations encapsulating a unique Amazon Kinesis Data Stream deployed in a single and centralized account.

But how does all of this actually work?

1 — Using CloudWatch Logs subscription filters, we set up real-time delivery of CloudWatch Logs to a bunch of CloudWatch Logs destinations encapsulating a single Kinesis Data Stream. And I say “a bunch” because, according to AWS, when configuring a subscription filter to send log data across accounts, there is a restriction for the CloudWatch log group to be in the same AWS region as the destination. We do use more than one AWS region.

2 — As the data becomes available on the Kinesis Data Stream, it is almost immediately consumed by Kinesis Data Firehose, at a maximum total read rate of 2 MB/second per shard. Behind the scenes, Kinesis Data Firehose is continuously pulling data from the Kinesis Data Stream, by making constant get operations at a one-second interval for each shard.

3 — Data received by the Kinesis Data Stream, coming from the CloudWatch Logs, is Gzip compressed. Therefore, we must first pass the data through a Lambda-based data transformation in Kinesis Data Firehose, to decompress the data and place it back into the stream. AWS has a list of available blueprint functions for this exact purpose. Taking the “Kinesis Firehose CloudWatch Logs Processor” Lambda blueprint as reference, you can add additional logic into the Lambda function and have it enrich and filter data events based on the source records. For example, we made our Lambda guide data to different indexes depending on source event log group and stream.

4 — Firehose then delivers the massaged events to the Splunk HEC. As a best practice, we, in fact, set up an Elastic Load Balancer (ELB) as the destination endpoint for our Amazon Kinesis Data Firehose stream, for high availability, throughput and scale. As we’re planning to ingest large amounts of data, we used an actual Classic Load Balancer. Splunk recommends using an Application Load Balancer (ALB) only for dealing with light traffic loads. Network Load Balancers (NLB) are not supported.

5 — The Kinesis Data Firehose is configured to automatically backup data to an Amazon S3 bucket, in case any issues arise while trying to deliver data to Splunk. Between choosing to back up all data, or only the data that’s failed during delivery to Splunk, we went for the latter. This way, we won’t lose any valuable events and, if we find it necessary at some point in time, we can even deploy an alternate mechanism to try ingesting them automatically into Splunk again.

Awesome, right? True. But this is almost a whole new data collection mechanism in itself. Maybe that’s a topic for another time!

6 — Splunk parsing configurations, packaged in the Splunk Add-on for Kinesis Data Firehose, are then responsible for preparing data for querying and visualization, by extracting and parsing all fields as instructed.

And, that’s it. Data is now in Splunk, ready to be searched and consumed. Woo-hoo!!!

Watch Out for Sharp Bends

Wait! This all sounded too easy, right?

Source: https://giphy.com/

You set up a few AWS components here and there, and it’s all good? That’s it? Nothing else is needed? Hmmm… What’s the catch? There has to be one.

As of this moment, you’re probably questioning this yourself. And you’re right in doing so. It seems too easy. But it’s not. It’s not all rainbows and unicorns. There are a few things here and there that you should know before diving deep into your own Splunk/Kinesis deployment. Otherwise, you might end up with a few errors. We know we did.

That said, this is what you should know and watch out for, prior to onboarding on this journey:

1 — The source log group and the destination must be in the same AWS region.

The destination can point to an AWS resource that is located in a different region though. That’s entirely fine. You don’t need to have multiple Kinesis Data Streams configured, one per each region. You can have just one, in a region of your choosing.

2 — When creating your CloudWatch Logs subscription filters, be sure to set the “distribution” property to “Random”.

The goal is to ensure that the entire throughput provided by the number of shards you provision on your Kinesis Data Streams can be utilized at the fullest. This will save you from a lot of trouble.

By default, CloudWatch will use the name of the log group as the partition key while putting data into the Kinesis Data Stream. This can result in exhausting your Kinesis Data Stream throughput and making your transformation Lambda to throttle because, as it is invoked, the Lambda will truncate the data payload coming in and re-ingest records exceeding a 6 MB threshold back to the source stream (using the same partition key). Ultimately, this might end up causing data skew across all the shards in the stream, since it is highly possible that a subset of log groups putting data into the stream would have a higher volume of logs than others.

3 — Set your Splunk HEC endpoint type as an event endpoint.

The Splunk Add-on for Amazon Kinesis Firehose supports data collection using either of the two HEC endpoint types: raw and event. For a wide variety of use cases, a raw endpoint works. However, if you choose to apply an AWS Lambda blueprint to pre-process your events into a JSON structure and set event-specific fields, then you need to set the Splunk endpoint type as an event endpoint.

4 — Enable sticky sessions and disable cookie expiration for the ELB acting as destination for your Kinesis Data Firehose stream.

Using the AWS Management Console, this is accomplished after actually creating your ELB. You must go to the “Port Configuration” section of your ELB and click on “Edit stickiness”. Tick the option “Enable load balancer generated cookie stickiness” and set the value of 0 seconds as the “Expiration Period”.

5 — Ensure that your HEC endpoint is terminated with a valid CA-signed SSL/TLS certificate, matching the DNS hostname used to connect to the endpoint.

Amazon Kinesis Firehose strictly requires you to do so. So, what you’ve got to do is import your valid CA-signed SSL/TLS certificates to AWS Certificate Manager before creating or modifying your ELB, and then use that certificate on the ELB.

6 — Enable indexer acknowledgement on the HEC.

This is to ensure that Amazon Kinesis Firehose knows what to do in case of delivery failure.

The HEC, upon receiving an event successfully, immediately sends an HTTP Status 200 to the sender, meaning that event data received is valid. While HEC has precautions in place to prevent data loss, outages or system failures can still occur, resulting in events being lost before they are indexed if acknowledgement is not enabled.

7 — Increase the HEC acknowledge timeout and retry duration for the Splunk destination when setting up your Amazon Kinesis Data Firehose stream.

As we set up our deployment, we almost immediately found out that the default values for these properties were not enough, as we were getting a lot of splashes being delivered to S3. It was only after raising these values that we stopped getting delivery errors.

By increasing the HEC acknowledge timeout and retry duration, you will be allowing Firehose more time to retry delivering the data in case of any failure.

8 — Your ELB endpoint must be publicly accessible with a public IP address.

This is an actual limitation on AWS. For added security, we suggest setting up your ELB with a security group allowing access from Kinesis Data Firehose IP addresses only.

Pay attention to the AWS region where your Kinesis Data Firehose is deployed. To know which IP addresses are used by Kinesis, refer to: Access to Splunk in VPC — Amazon Kinesis Data Firehose.

All Good Things Must Come to an End

At the end of the day, we achieved what we wanted. Data is in Splunk, in what we’d call an elegant way. The greatest advantage from such an architecture pertains to the centralized management of the data collection mechanism in a single, unique, AWS account. And that is, for sure, something to value.

The only thing needed to configure on other AWS accounts are the CloudWatch Logs subscription filters, which are a fairly easy thing to set up and very unlikely to raise any errors. Errors may occur, yes, but on a completely different end of the architecture, as explained.

Hope you enjoyed the article. Happy Splunking. Happy AWS-ing!!!