Splunk Pipeline, the road to native

Airwallex Security Team
Airwallex Engineering
7 min readAug 15, 2022

The world has changed:

In the new (post covid) world our working habits, environments and (yes) security has changed. We are no longer bound to a physical location with a single main office network, and most of the services we use have long been transitioned to SaaS providers — with the data stored in them.

The classic model of fetching the logs from the internal IT systems (Syslog I’m looking at you) isn’t covering the entire landscape. In today’s world most of the action is happening in environments we don’t control or own (SaaS and other fluffy objects).

This leads to substantial challenges, with existing monolithic approaches (dare we say “legacy”) failing to adapt.

Back in the 2021, Airwallex was facing a similar challenge; we wanted the ability to monitor our online assets for indicators of a threat or breach. Splunk was the data destination, but we weren’t excited by the existing approaches to obtaining data (i.e monolithic slow to deploy apps).

Unlike traditional log sources, SaaS providers audit logs are all about API’s (or other cloud based systems). The days of tailing a file (and logstash forwarding it) are gone with service providers like Google Workspace, Atlassian, Lastpass, etc.. providing APIs as a means to obtain logs.

We wanted a platform that can scale easily, requires minimal operational resources and is easy to develop and debug, so we decided to go native (Cloud Native!).

The road to native

Phase 1 — (the drawing board)

From the get go, we had a basic set of requirements to cover:

  1. Simple to operate and maintain: we don’t want a large set of people operating it (we’d rather focus on security ;) ).
  2. Reliable: not having the logs means no monitoring/alerting.
  3. Extensible: we would like to be able to add new providers easily.
  4. Timely: logs should arrive within an expected time range.

As a cloud native solution, the first thing that popped to our mind is of course, a cloud function, which is basically a small program ready to be executed on-demand based on a predefined trigger (pub/sub message, API call etc.). This fits the bill perfectly when it comes to scheduled tasks that collect audit logs.

It is simple in the sense that we don’t need to herd machines (or cats) to operate it. It’s also simple from the UNIX philosophy perspective where each component does a single task well, enabling it to be composed with other components in arbitrary ways (functional composition anyone?).

We also decided early on that a single “monolithic” function with internal routing logic would be simpler to maintain — monitor and deploy (vs a function per provider). This might sound contradictory to the single task principle, but it makes sense due to common logic between providers and is also a simpler “packaging” process.

Satisfying the reliability bracket meant our basic architecture should be resilient for intermittent failures and errors that could (and do) happen. By designing the system around a queue with a recurring schedule and persistent state, we can make sure the system will recover from such fatal cases.

Sticking to existing pre-build core components adds another layer of simplicity and reliability since we don’t need to worry about maintaining/scaling them (standing on the shoulders of the cloud).

Our infra and code works to fill the extensibility requirements by making sure providers follow a similar pattern in fetching and persisting state, while still supporting unique behaviour when needed.

Lastly, timeliness means that our fetching strategy should be stable enough for different providers SLA’s, while we work to ship the logs as quickly as they are available.

That said, this system isn’t real time and being pull based (vs reactive) we expect T minutes of delay (where T is the wake up in minutes of the function by the scheduler).

Phase 2 — Architecture

As hinted above, we settled on the following basic building blocks:

  • A single GCP function triggered by pub sub on scheduled intervals
  • A Cloud scheduler that triggers the function (every T minutes)
  • Using BigQuery as a persistent state store for the fetch state.

The function supports multiple data sources, and the code is extensible so adding new data sources is made possible.

Phase 3 — (The fickle nature of time)

While all the fetchers are different (call different APIs using different auth schemes etc.) the basic fetching algorithm is shared between them. This makes our code simpler and makes our system easier to track and maintain.

Our first fetching implementation kept things simple with the following algorithm:

  1. Wake up every 5 min
  2. Grab last time stamp ‘t’ and fetch ‘t + 5’ min
  3. On success persist ‘t + 5’ and go back to sleep.

On the surface, this approach works well but it fails to take one important consideration into account, the processing time itself!

In actuality, the function has a built-in delay:

‘t + 5’ min < ‘t + 5’ min + [function processing time]

As a result, the timestamp is delayed for every run we persist. This accumulation of delays makes logs arrive late, so we went back to the drawing board and came up with a new fetching algorithm.

With this new algorithm, we use the last event date as our next starting point; this alleviates the processing delay issue (since the next fetch will no longer be dependent on when the last function has finished running).

What’s in the box

So, what are we releasing here?

The Splunk Pipeline OSS project (imaginative name we know!) contains a single easy to deploy cloud function on top of GCP which supports the following audit logs providers:

The docs folders located in the repo include everything from deploying the code, to how to run the code locally.

Keeping things tight

Being a security-oriented team, we took the following design/implementation choices:

  1. All secrets are stored in GCP secret Manager which brings a lot of security benefits (auditing/encryption/lifecycle management)
  2. A minimally scoped service account setup in IAM (the function should be able to access only the resources it needs)
  3. Using a dedicated secure project to hold the function (preferably one that has only limited access to other projects and vice versa)
  4. Supporting internal (proprietary) transformation logic; we support the fetching logic to be extended prior to data being persisted, this enables the right balance between open source and internal logic that each organisation might have.

More food for thought:

Time to delivery

Since we are dealing with alerts/security, having the logs arriving as soon as possible — from the moment of inception to the time they are in available in Splunk — is key. This time is comprised of the following:

R + S + F + I = Delivery time

Let’s break apart the above variables:

  • R — Remote SaaS delay. Different providers have different SLA’s for the delay between an event being logged, to it being made available to be fetched by the API. It’s important to be aware of those numbers and take note of them.
  • S — Function Schedule time. The magic number should be short enough to provide a sane delay, while also allowing enough time for data collection.
  • F — Function Runtime is composed from: Fetch time + Processing time + Write into Splunk time. The code is single threaded (thus simple) and has proved to be performant enough for the number of events we need to capture.
  • I — Splunk Indexing. In practice, we didn’t find this to be a huge issue but depending on the state of your cluster YMMV.

While we control some of the variables above (S, F), some are pre-set for us; it’s key to be aware of existing limitations so we can better plan our internal alerts SLA.

Scaling

The current implementation fetches the logs per single time range query as one bulk into memory and writes them (in batches) to Splunk.

For most fetchers, this scheme has proved to be sufficient (as long as the size of the logs in a single batch fits to memory). However, for some providers (like Google Drive) the number of logs can exceed the single run/function model (mainly during spike events).

There are number of possible future improvements we can make:

  1. Batch reads not just write and persist each read batch. This is a bit tricky since not all of the API may support this (we may need to resort to splitting date ranges in our queries).
  2. Use count number of events APIs (where they exist) and split the bulk we fetch.
  3. Defer/Split fetches between function invocations, so the load is split between each function.

While the current setup is more than sufficient in most cases, capacity planning is key to ensure the system will continue to perform as expected.

Epilogue

We chose a cloud native approach for audit log fetching that provided us with flexibility, low maintenance cost and quick turnaround times for changes combined with high reliability and timely delivery.

By opening this to the community, we hope that other teams and companies would benefit also.

--

--