SOC-X : Part 1 — Machine Learning and Big Data Analytics in Security

Being an online commerce platform, managing incident response effectively is one of the core responsibilities of MMT’s IT Security Function. This five part blog series focuses on taking a traditional SIEM implementation to the next level by implementing various tools and techniques. It aims at addressing the following technology or operational issues with traditional systems:

  1. Standard rule based security technology that detects only the “known bad
  2. Inability to process big and fast data efficiently
  3. Time taking forensics investigation and subsequent action
  4. Manual incident management

With Part 1, we’ll first focus on #1 from the perspective of website security. Standard security technology such as a WAF, IPS, traditional SIEM are good, but for a purpose — detecting the “known bad”. As long as the OEM or an analyst is able to define a signature for detecting malicious behaviour, such technology would work great in mitigating the associated threat. But what about the “unknown”? Attacks are getting more and more complex, harder to detect and are increasingly targeting mis-configuration or business logic flaws. Detecting such intrusion or mis-use becomes extremely hard to detect using traditional signature based systems since attackers could use legitimate channels of access and exploit available functionality.

This left us hunting for something more, a technology that will help find a “needle in the haystack”, will help detect behaviour that is “not normal” compared to previous trend. Way back in early 2015, when big data and machine learning was mostly used to address business use cases such as personalization, we stumbled across Apache Spark (PySpark) and felt it could solve such security use cases, we will attempt to take you through our journey with Spark and associated use cases.

To keep things simple, we will attempt to walk through the following two use cases that leverage big data analytics and machine learning (this assumes the reader possesses basic knowledge of Python and Apache Spark).

Use case 1 : Detecting malicious communication from endpoints using Proxy logs

Let’s assume an organisation has standard malware protection technology in place (Anti-Malware, Proxy, IPS etc). This use case attempts to detect communication that has bypassed the afore-mentioned security defence. We’ll use Apache Spark to analyse GBs of proxy logs to find interesting patterns / indicators of malicious communication.

Let’s first dissect a proxy logline:

1294011828.214 1309 TCP_MISS/200 10032 GET vik DIRECT/ text/html

Interesting fields for us

  1. User
  2. URL
  3. Destination Host / IP
  4. Bytes transferred
  5. Time elapsed

Let’s extract them:

def do_logparse(n):
SQUID_LOG_PATTERN = ‘(?P<timestamp>\d+)\.\d{3}\s+-(?P<time_elapsed>\d+) (?P<dhost>.*) (?P<cache>\w+)\/(?P<httpresponse>\d+) (?P<size>\d+) (?P<uri>.+) (?P<user>\S+) (?P<method>[A-Z]+\/\S+).+!\n’
match =, n)
return Row(
log_type = ‘SQUID’,
d_host =,
user =,
bytes =,
http_method =,
resource =,
time_elapsed =

What are we looking for?

Communication to domains that belong to the “unknown” category, there are some challenges:

  1. We might see a ton of IPs instead of hostnames (we’ll convert them to hostnames via SSL and WhoIs lookup)
  2. We will need to lookup domain categorisation in real-time
#Count the number of times a domain has been accessed
dhosts = x: (x.d_host, 1)).reduceByKey(lambda a, b: a + b)
#Filter out only un-categorized domains
uncategorized_dhosts = dhosts.join(reputation_matrix).map(lambda x: (x[0], x[1][1])).filter(lambda x: x[1] == None)
#Enrich IPs with SSL domain name
sdhosts_ssl =
#Enrich IPs with whois intel
dhosts_whois =

Use case 2:Web traffic anomaly detection using Apache HTTPD logs

Consider a website that deploys standard perimeter protection technology such as WAF, IPS that block known attacks such as SQL injection, cross-site scripting. We would like to identify / flag attacks that bypass such defence technology. We’ll use a similar setup to the previous example using Apache Spark to create a model of “known” or “good” IP behaviour and then compare incoming traffic to this model to arrive at an “anomaly score” which would depict how far the IP is from normal.

Let’s assume we have parsed Apache HTTPD logs available. We will generate a K-Means model using the following dataset:

[<source_ip>, <# of 200 hits>, <# of non-200 hits>]

We then create the model using one-minute of HTTPD logs as our training set, post which we test incoming HTTPD logs in one-minute batch windows against the trained model. The key here is, to identify the cluster to which the source IP belongs, and how far it is from the cluster centroid. The farther it is, the more anomalous it’s behaviour.

The below code snippet illustrates this example:

def make_features(a, b):
# Add up counts of 2xx and 5xx response against a source IP
twoxx = a[0] + b[0]
nxx = a[1] + b[1]
return [twoxx, nxx]
#Reduce training data to get counts of 200 and 5xx response codes against an IP
rawTrainingData = rawTrainingData.reduceByKey(make_features)
#Convert data to K-Means acceptable format
training_set = s: s[1])
model = KMeans.train(training_set, 2) #Train the model
#Apply the model on incoming testing data over a 60sec windo
results = rawTestingData.reduceByKeyAndWindow(make_features, lambda a, b: [a[0] - b[0], a[1] - b[1]], 60, 60).map(lambda s: predict_cluster(s, model))

Important considerations for ML

There are few aspects to keep in mind that largely govern the effectiveness of an ML use case:

  1. Feature selection — this is heart of your ML model. Make sure the features (or attributes) selected or generated directly impact and are relevant to the use case. For example, if we’re looking at DNS data exfiltration as a use case, the most relevant features would be [# of DNS requests per host, size of DNS request per host]
  2. The right algorithm — depending on your use case, seek assistance from data science teams to identify the right algorithm to generate the most effective outcome. For example, whether to use supervised learning as opposed to un-supervised learning.
  3. Training data — pick the right data for training your model, and make sure there are adequate samples of varying kinds depending upon your use case.

Sky’s the limit!

There are many other use cases that a security function can leverage, listing some below:

  1. Detecting anomalous network activity
  2. Detecting DNS data exfiltration
  3. Detecting bots / scrapers

We realised, writing all this code time and again could be cumbersome and not really the ideal job-role for a security analyst. To get around this, we built a big data and ML framework called “dataShark” in 2016 and made it available for public consumption as an open source project —

More on dataShark in “Part 2 — The dataShark Framework”.