ML-Free Behaviour-based Detection with Splunk

The end of static thresholds. No ML.

Published in

Adarma Tech Blog

6 min readJun 19, 2020

Explaining behaviour-based detection can make you sound smart, but in reality you don’t have to be a genius to create effective detection rules based on behaviour; Splunk makes it simple.

Despite Machine Learning being a great tool on the detection game, it does take some brains and can be time-consuming. Leaving that to future improvements list. In the next 5 minutes, let’s have a look at how to be effective with the simple and old statistics.

PS: we will be using Splunk BOTS 3 dataset.

Why would you detect based on behaviour? Consider Time series data:

index=botsv3 sourcetype=”aws:cloudwatchlogs:vpcflow”
| timechart sum(bytes) as sum_bytes
| eval sum_mbytes=sum_bytes/1024/1024

On the left we have the sum of bytes transferred in/out an AWS VPC, while on the right we have megabytes (bytes/1024/1024) — They look the same so I’ll use megabytes throughout this article.

Say you want to detect an anomaly. How much (or little) is an anomalous variation for your curve? A few options exist, with respective pros and cons:

a. Define Static threshold: Quick, but also dirty. It’s likely that the conditions will change and so will the acceptance threshold. Maintaining static thresholds is a hard job and usually causes many false positives — not to say the thresholds aren’t the same for all resources in the pool.
b. Create a fancy Machine Learning model: Yes, this is a great option — but not always viable when time and resources are limited, and often it’s an overkill. Variations require investigations but they may not be as critical as a ransomware detected in a company’s network.
c. Create simple behaviour-based detection searches: Bring together some table-like data modelling expertise, add enough Splunk commands for the heavy-lifting with pinch of statistics, mix it well to get a nice and easy [pun alert] solution.

Building behaviour-based detection

Dynamically establish a baseline, then alert on anomalies based on this. Anomalies can be many things, thus there are many ways.
Consider the objective for your use case, and ask the question “What does bad behaviour mean?”. The answer should be complex, this will be the trigger.

The Recipe:
1. Establish the baseline: Transform data into time series (discrete and/or variation)
2. Smooth the curve out (optional)
3. Establish the dynamic threshold - a.k.a. anomalous behaviour
4. Filter and detect.

Delta

It’s the variation, increment, or difference between one value and predecessor in the series.

index=botsv3 sourcetype=”aws:cloudwatchlogs:vpcflow”
| timechart sum(bytes) as sum_bytes
| eval sum_mbytes=sum_bytes/1024/1024
| fields — sum_bytes
| delta sum_mbytes as delta

The delta curve looks similar to sum_mbytes; however, if you look closely, you will notice delta goes below zero. Simple, as the traffic goes down, the variation is a deduction in the sum of bytes at the next reading.
What delta represents is the variation, thus if your trigger is a spike, or a drop, the delta may be your way to go.

Let’s transform the variation in a percentage — humans love %’s.

index=botsv3 sourcetype=”aws:cloudwatchlogs:vpcflow”
| timechart sum(bytes) as sum_bytes
| eval sum_mbytes=sum_bytes/1024/1024
| fields — sum_bytes
| delta sum_mbytes as delta
| autoregress sum_mbytes as previous_position
| eval percentage=delta*100/previous_position
| fields — previous_position

The above shows a new line, *percentage.* First, we bring the previous value into the current series (autoregress, which adds the field **sum_mbytes_p1**, meaning one position before current) so we can calculate the percentage with *eval*. Also notice percentage has its in overlay and has its own scale on the right.

Now that we can measure variation in the unit of percentage we can establish a threshold. If your data is a consistent series of values, your threshold can be sensitive to considerable changes. How much is too little or too much for your environment?
If it’s not consistent, read on.

Moving average

It’s a good way to smooth your curve out. Also known as trendline, you can somewhat see the trend of your data, or the average without sacrificing the time from the series:

index=botsv3 sourcetype=”aws:cloudwatchlogs:vpcflow”
| timechart sum(bytes) as sum_bytes
| eval sum_mbytes=sum_bytes/1024/1024
| fields — sum_bytes
| trendline sma2(sum_mbytes) as moving_avg
| delta moving_avg as delta

In green, the moving average is a smooth version of sum_mbytes, hence Delta is now smooth as well.

Percentile

This is my favourite. No average is safe when percentile is in town when it comes to numerical distribution.
The 50th percentile is the median — which is the value in the middle if you sorted your series. It is greater than 50% of the lower values.
e.g. 2, 4, *6*, 9, 11.
The 90th percentile is a value greater than 90% of the lower values. In detection terms — your quality threshold, or your normal values.
The owner/admin defines quality threshold. The higher it is, the fewer positives you are likely to get, false positives and true positives included, as lower are the chances of a value being this far from the rest of the series (the normal).
More on Normal Curve and statistics here on Wikipedia.

index=botsv3 sourcetype=”aws:cloudwatchlogs:vpcflow”
| timechart sum(bytes) as sum_bytes
| eval sum_mbytes=sum_bytes/1024/1024
| trendline sma2(sum_mbytes) as moving_avg
| delta moving_avg as delta
| fields — sum_bytes sum_mbytes moving_avg
| eventstats perc50(delta) as perc50, perc95(delta) as perc95, perc98(delta) as perc98

Delta calculated from moving average, all fields not needed were removed, 50th, 95th and 98th percentiles calculated. Notice 50th/Median in the middle of the graph in green.

Or the same applied to the sum_mbytes field (mega bytes):

index=botsv3 sourcetype=”aws:cloudwatchlogs:vpcflow”
| timechart sum(bytes) as sum_bytes
| eval sum_mbytes=sum_bytes/1024/1024
| eventstats perc50(sum_mbytes) as perc50, perc95(sum_mbytes) as perc95, perc98(sum_mbytes) as perc98
| fields — sum_bytes

No moving average. 50th, 95th and 98th percentiles calculated from sum of mega bytes.

notice the line touches the threshold. This could easily be a false positive (even though 98th percentile is quite far from the median.)

Many times, I have applied a multiplier on top of the percentile, 2x, 3x or even 4x times — for the sake of reducing false positives to a minimum.

And the results were amazing. I have detected credential stuffing attacks to internet facing services, vulnerability scans coming from competitors, internal reconnaissance by a compromised account, and other anomalous behaviours just by calculating the things above.
It can considerably increase your detection accuracy, with little effort.

Let’s put it to test. If I expand the time of our time series we see a meaningful spike:

index=botsv3 sourcetype=”aws:cloudwatchlogs:vpcflow”
| timechart sum(bytes) as sum_bytes
| eval sum_mbytes=sum_bytes/1024/1024
| eventstats perc98(sum_mbytes) as perc98
| fields — sum_bytes
| eval threshold=2*perc98
| where sum_mbytes>perc98
| eval category=”anomaly”

PS: This is the search for the annotation (or the alert). Remove the last two lines to visualise the graph.

Threshold breached with a chart annotation

2 times 98th percentile is way up. Very far away from the median. Looks like we have a positive.

Bonus: Bi-dimensional Detection

If you want to split the series data by a category, just add the clause by to the commands that create the series (eventstats) and the commands that read all events (eventstats).

index=botsv3 sourcetype=”aws:cloudwatchlogs:vpcflow” src=172.16.*
| bin span=2m _time
| stats sum(bytes) as sum_bytes by _time src
| eval sum_mbytes=sum_bytes/1024/1024
| eventstats perc98(sum_mbytes) as perc98 by src
| fields — sum_bytes
| eval threshold=2*perc98

PS: I replaced timechart with bin + stats because timechart removes the column name that we need. And don’t forget to enable trellis to avoid a mosaic as a graph.

And there you have it. Behaviour-based detection applied to multiple internal hosts with individual dynamic thresholds, and no machine learning involved.

Happy Splunking.