Scale-out Security Event Processing for detection and analysis.

Attackers have skills and ingenuity — defenders have tools and budgets. This needs to change. Now.

TL;DR Whether you’re into DFIR, Threat Hunting or Threat Intelligence you can build a scalable framework that enables you to shine light into the darkness of your infrastructure. You can apply your knowledge, experience, skills and ingenuity towards detection and response — this is neither hard nor expensive and can be done in near-realtime.

Background

We have a pretty big infrastructure that is capable of generating lots of events per second. How do you ingest, process, store and analyse something like 500.0o0 events per second? How do you ensure that once you have gone through all the trouble of collecting it that this data is turned into usable insights? How can log data provide business value in the context of Information Security? How do you detect evil in a firehose of events?

I’ll give you a somewhat detailed view of how we address those questions, the toolchain we are using and what value we see in the generated data when it comes to applying detection methodologies. You should be able to stand up similar capabilities in a short amount of time without spending much (or any) cash to play around with the concepts outlined below.

Security Events are Big Data

Big Data is all about volume, variety and velocity — exactly the same challenges that most organisations face when dealing with security related events.

By treating Information Security as a Big Data problem you can hook into at least a decade of experiences, practices and tools developed for that exact purpose.

You’ll need two distinct Big Data capabilities for Information Security events, stream processing and data warehousing. I’ll be mostly talking about the stream processing stuff today.

Data ingestion

We log everything that provides value resulting in an unwieldy firehouse of hundreds of thousands of events per second. The majority of those end up in Apache Kafka.

Kafka is the backbone that makes our entire processing pipeline possible.

Kafka let’s you publish and subscribe to streams of records, much like a traditional MQ, and let’s you store those records in a fault tolerant way. Unlike most MQ solutions, Kafka was designed with strong durability requirements without sacrificing performance or scalability. In fact, all data ingested into Kafka get’s persisted on-disk … but without killing performance or requiring enormous hardware.

A cluster of a handful of modern servers will happily ingest 1M+ events per second.

Kafka organises records into topics — a high level construct that resembles categories. You publish records to topics and you consume records from topics. You can also control for how long data should be kept on disk per topic.

Topics are split into partitions to scale operations against a single topic. Each partition is an ordered, immutable sequence of messages that is continually appended to — a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.

Operationally Kafka is mature and battle tested by many of the companies that we all use on a daily basis.

Lot’s of tools offer a Kafka integration so you can publish logs directly into Kafka. If you need commercial support or a nice UI for Kafka, Confluent is there to help you out.

Stream processing

This is where the fun begins. Instead of using Logstash or Graylog to simply consume the Kafka topics we run a lot of transformations against the individual records :

  • we normalise events to give them a bit of structure for easier analysis.
  • we enrich events with data from other sources or systems.
  • we have the ability to react to incoming events.
  • we have the ability to generate realtime statistics.

We have written some custom code that let’s us do this but you can do this too without writing any code (although perhaps not quite as fast) … more about that later in this post.

The idea is actually very simple, events enter one topic, are picked up by our stream processing code and the modified events are then written to a different topic that in turn get’s consumed by our Graylog cluster for indexing.

This however introduces latency (the time between an event enters and “leaves” for it’s final destination). This latency can come from either our stream processing code itself of from Graylog. But this can be monitored and managed by starting additional consumers. More about that later :)

Kafka is neat because multiple consumers can, well, consume the same topic in their own accord. You could have one consumer that consumes as fast as possible while another consumer chews through the data at much slower rates. This allows you to build logical pipelines that allow you to cater for latency requirements (fast vs slow pipelines).

This also means that it is very easy to test new ideas or changes — simply ask your new or modified consumer to process all data stored in a topic from the beginning.

You can also pack your consumers into consumer groups where each group is composed of many consumer instances for scalability and fault tolerance. This is something we utilise heavily in our processing framework.

Normalisation

We have a set of normalised field names (src_ip, dst_ip, user, process, etc) that are added to events right alongsides their own, individual nomenclature. Why? A person familiar with Windows Eventlogs should recognise an event as such … so enforcing a synthesised schema has proven counter productive for us. Having those common, normalised fields however greatly enhances the ability to search or analyses events later.

Enrichment

We do a lot of enrichment. Why? Because every layer of metadata that we add is also an added dimension in our detection matrix. And it’s tremendously helpful for our analysts since they don’t have to pivot between 15 different browser tabs or tools in order to figure out data about an event.

Here is a quick list of some of the enrichment we’re doing :

  • we do DNS lookups for every IP address, add those to the events and store them in our Passive DNS database. We also count occurrences.
  • we do Geo IP lookups for both external and internal IP addresses (yes, we tag all internal IPs with their internal locations).
  • we do LDAP queries against Active Directory for every username and store data such as the individuals name, title, role, department, etc.
  • we integrate various, deduped Threat Intelligence sources and check those against fields in our events. Results are appended to the event.
  • we map Vulnerability scan results against events to check if those ports are currently exploitable (sadly, remediation takes longer than scanning in most organisations).
  • we count and check file hashes of processes being executed and dll’s being loaded.
  • we do lookups against our data warehouse (analytical queries over large datasets) and fast memory caches and include the results.

Why do we go to such length to collect and process all that data? Well, like I mentioned earlier it helps with detection and responce. We can suddenly slice and dice events based on different criteria, such a user roles or other organisational aspects.

We do a lot of statistics (well, counting really) on incoming data to identify some specific traits. We flag all first-time occurrences, such as the first time a users logs into a systems, the first time we see a hash for a process, the first time we se traffic between systems. We also count every user, login, process, dll, hash, domain, etc. Why?

Because most malicious events are also rare events.

Let’s look at one simpel but concrete example — a sysmon event :

UtcTime: 2015–12–28 18:32:54.437 
ProcessGuid: {B4A06220–8BD6–567D-0000–001061755A0C}
ProcessId: 1328
Image: C:\Windows\System32\cmd.exe
CommandLine: cmd /C “echo 3”
CurrentDirectory: C:\Windows\system32\
User: NT AUTHORITY\SYSTEM
LogonGuid: {B4A06220-C3FE-5671–0000–0020E7030000} LogonId: 0x3e7 TerminalSessionId: 0
IntegrityLevel: System
Hashes: SHA1=0F3C4FF28F354AEDE202D54E9D1C5529A3BF87D8 ParentProcessGuid: {B4A06220-C446–5671–0000–0010B3340200} ParentProcessId: 2276
ParentImage: C:\foo.exe
ParentCommandLine: “c:\foo.exe” — c “c:\foof.conf”

There is a lot of potentially interesting stuff here!

  • unusual parent processes
  • unusual image execution locations
  • unusual image names
  • unusual user contexts
  • unusual integrity levels (ie. Run as admin)
  • suspicious command line arguments
  • unusual execution times

… and of course there’s always the hash of the actual file.

There are a number of techniques you can use to pinpoint “unusual” or “suspicious” … counting occurrences per timespan, user, role, department, hash, image name, parent process, etc is just one of them.

You might also want to simply flag the execution of certain Windows PE files such as quser, regedit, wmic, cmd, powershell, net, etc … many of them can serve as red flags in ordinary office environments. Other events can be used to find injections, track migrations, persistence mechanisms, etc. Fun fact, you can find most fileless malware by looking at the size of the registry keys being written.

You don’t need lists of known-bad or known-good … which in turn leads to much higher detection rates.

You can detect insane amounts of evil by simply flagging rare occurrences of events when you are looking in the right places.

To be entirely fair you can also find evil by looking at the opposite — brute-forcing being the classic example here.

I have written something about this concept here and here as well, utilising Graph databases for the same concepts.

At the same time we would like to add context to our events. If we have seen that process hash 5000 times across hundreds of users distributed over 2k machines in the last 3 days we should add this to the event to help the analyst make decisions.

All this is actually quite costly in terms of processing time, especially lookups against external systems over the network.

Processing time equals latency to we use a few tricks to minimise the impact this stuff has. We cache the results of those operations heavily and we keep those caches as close to the processing code as possible.

Most caching is done using Redis, the DNS stuff is cached using local unbound instances. Redis does persist it’s data at regular intervals so that we can re-initialise those instances with their caches already warmed up.

Since our stream processing code is stateless we run the entire thing inside containers and can basically scale them as long as we have available compute resources … going from 50 to 500 or 5000 running containers is a matter of minutes.

By monitoring the position (offset) of our consumers we can implement autoscaling so that containers are launched and stopped accordingly.

Data warehousing

Besides indexing stuff in Elastic we also need a way to store and analyse potentially hundreds of terrabytes of data in an efficient manner (read: not in Elastic). Hadoop really is the elephant in that realm (pun intended).

The data that we need for long term analysis is stored in HDFS and available for processing by Apache Hive. Hive is neat because it’s SQL like query language is familiar to most people and it can work on very, very large datasets within a reasonable processing time. Utilising something like Apache Hue as a frontend for operations against Hive makes this actually rather enjoyable.

We have certain Kafka topics where records are searched against this warehouse as well, like checking reports from our sandboxing software against the last weeks or months of collected data.

It also serves as an excellent platform for Threat Hunting since we also store a lot of the generated statistics in there :)

Getting data from Kafka to Hive is a bit more involved, there is a sink that can automagically write Kafka data to Hadoop but that requires serialising your records in something like Avro.

Tools

If you have made it this far I would like to leave you with a few pointers about tooling that can help you generate events and get those into Kafka and show you a way to get started without having to write code.

  • Bro IDS can collect awesome NSM logs and ship those to Kafka.
  • Sysmon and auditd can log pretty awesome system level events.
  • Threat Intelligence is often freely available from sources such as Alivenvault OTX.

You can ship Windows Eventlogs directly to Kafka using the latest winlogbeat beta. Generic logfiles can be shipped using the latest filebeat beta. Syslog can be shipped to Kafka using either rsyslog or syslog-ng. Netflow data can be skipped to Kafka using logstash.

Much of this can be achieved by using logstash containers with the help of a few plugins …

… but you should be aware that if you need raw performance you should probably develop something using the Kafka stream processing API. Nevertheless the logstash stuff will help with rapid prototyping.

Conclusion

In a platform like this you own and control the intelligence, the workflow, the decisions and the data. No magic whiz-bang devices, no vendor lock-ins and no capital bound in tooling that will not let you adapt quickly to the rapidly changing landscape that Information Security represents.

We have integrated a number of response capabilities as well (think automated or human-assisted actions against our infrastructure) but that’s left for another article.

You should step up and own your detection or response problems — leaving this up to vendors is fighting a loosing (and costly) battle. I am not advocating that you should build everything yourselves but architecturally a framework like I have outlined will allow you switch most components of your security infrastructure without major headaches.

If you’re in a position that requires support contracts for the stuff you want to run everything I have talked about has commercial support available.

This is not our only defensive mechanism — we do a lot more than this and so should you.