SecOps: Pluggable Security Infrastructure at Scale
During a recent conversation I mentioned our ‘Event Core’ and promised to do a quick write-up.
We have spent the last few month giving our Security related intrastructure some serious thought — we wanted a higher degree of flexibility and control, we knew that we needed to automate a lot of trivial tasks to ensure that our SOC can concentrate on the stuff that needs human interpretation and we also wanted freedom to experiment with new tools and methods without having to reinvent the wheel every time.
So we ended up building a SOA inspired architecture around a number of loosely coupled services. The overall architecture somewhat resembles the diagram below:

The blue boxes represent examples of various different tooling that we utilise for our Security Operations — all of them interface with our Event Core thus enhancing the speed and precision of our detection, response, mitigation and remediation capabilities.
We divide them into two categories; active and passive. Active services push events into our Event Core — passive services on the other hand are consumed by the Event Core to provide additional data or capabilities.
Most of the interactions within the Event Core are asynchronous — a design choice that has proven itself when dealing with a large number of concurrent events that need to be handled efficiently but we can opt for synchronous behaviour where or when it is needed.
The Event Core
The Event Core is constructed around a robust, fault tolerant and scalable HTTP API while the majority of the work is handled by distributed background processing jobs — this enables us to handle tens of thousands of jobs per second by scaling out the number of API and background processing machines.
We expose a number of HTTP API endpoints to integrate a range of services — a simplified example looks something like this:
# endpoint for suricata events
post '/suricata' do
payload = JSON.parse(request.body.read)
# enqueue job
SuricataWorker.perform_async(payload)
end
Once an event enters the Event Core it is enqueued as a background job and picked up by a worker process which evaluates and normalises the event. Depending on the content of the event several other jobs are enqueued and subsequently picked up and processed by other workers.
All of the background processing is handled by machines running Sidekiq — Sidekiq is a fantastic tool for running large job processing farms and it enables us to do so easily and reliably.
A worker class might look something like this:
require_relative '../helpers/resolver.rb'
class SuricataWorker
include Sidekiq::Worker
sidekiq_options queue: 'suricata', :retry => 1
def perform(payload)
# main code logic goes here
end
end
We have a range of different jobs that aid in detection, response and remediation such as :
- Looking up URL’s, Hashes, IP adresses, etc in tools like VT
- Executing searches in log or metric stores
- Interfacing with our Sandbox cluster
- Taking actions
- Classifying assets
- Classifying events
- Enrichment of events
- Creating and updating alerts and tickets
- …
Classifying Assets and Events
We deal with security related events from many different systems — most of them have their own nomenclature for classifying events and assigning severities. This creates a contextual problem for analytic decisions so we run them through a classifier job that basically normalizes the events into a set of common criteria. This is usually the first job that get’s triggered for an event — depending on the content of the event this job also kicks a variety of other asynchronous jobs such as the ones mentioned below.
All our events contain something that can identify an asset — an IP adresse, hostname, user, etc. We run a job for every event that looks up certain aspects of the asset in questions and adds that data to the event in question. This is information such as the asset type (Server, Workstation, Laptop, ICS, etc), physical location of the asset, the risk profile, etc.
Taking actions
Depending on the nature of the event we might want to trigger a number of automatic actions against the asset in question — we usually split them into 2 categories :
Triggering non-invasive actions
This could be anything from fetching forensics artifacts from an asset or triggering a memory dump to running aggregations and searches against others systems.
Triggering invasive actions
Having code that allows you to lock out users or isolate machines from the network is invaluable for handling events that have the right mix of severity, business impact and confidence.
Creating alerts and tickets
Having the capability to selectively route tickets or alerts to different teams or parts of the organizations is a must have for larger organizations — we delegated certain events to other teams to be handled in order to save time and restrict the SOC work to the more interesting stuff. The key here is to do so without any human involvement so we have jobs that handle this for us automagically.
Mutating / enriching Events
Sometimes it makes sense to add valuable information to an event such as the last logged-on user, last 10 executed binaries, etc. Once a ticket or alerts is created our jobs can update them with relevant information.
Conclusion
We process many thousands of jobs in our Event Core each day — the entire system is easily maintainable, robust and fault tolerant making it a corner stone of our SOC operations. While it may sound complicated it’s actually a simple, open and extensible system that also makes a lot of economic sense.
It also let’s us move fast — integrating new systems or developing and deploying new jobs is easy and straight forward.
Running a distributed system like our Event Core has actually never been easier — there are lots of recommended practices and tools available that help with scaling, maintaining and monitoring such systems.
Could we just have bought a commercial solution? Perhaps.
Would it have been as flexible, scalable, fault tolerant and effective as our Event Core? Probably not.