How ManoMano manages and responds to millions of security events : A do it yourself spirit and automation

Jules Duvivier
ManoMano Tech team
Published in
9 min readDec 7, 2020
Figure 1 — The management of an incident flow

Information systems are becoming more and more complex and interconnected and the technologies that drive them are becoming more and more volatile. We must therefore arm ourselves to be able to absorb a flood of information, detection and signals from multiple sources of varying quality, while maintaining reasonable processing times through automation of time-consuming tasks where possible.

Manual security operations are increasingly becoming expensive and ineffective as the number of alerts are exponentially growing given the scale of digital transformation and the increase in the volume of new threats.

At ManoMano, one of the mantras of the security team is to focus on rights threats. And thus automate everything that can be automated, to free up as much bandwidth as possible for the team.
Our platform was therefore designed to fit into this paradigm without necessarily following some of the main principles of a SOC by being creative.

In this post, we will present the SOAR - Security Automation, Orchestration and Response - platform we are working on to deal with this flood of threats.
This platform aims to automate the management of an incident as much as possible by taking advantage of the strengths offered by the open source tools TheHive and n8n.

Work smarter, respond faster, and leverage the true capability of our security infrastructure

1. Collect and aggregate logs to detect suspicious behaviours

Datadog is our observability solution for everyone. It provides logs, metrics and traces that allow us to fully observe our infrastructure and applications, from integration to production environments. Side open-source solutions have been deployed in order to take advantage of additional functionalities to discover our information system.

The logs of our infrastructure, applications and services are a very good baseline for detecting security incidents.

As Datadog should be considered as our log collector, the security team also sends it the logs and events from many of our tools (e.g WAF events, GSuite logs, Falco events …). These logs and events are only visible on Datadog by the security team and allows us to retrieve and aggregate all potential security signals in one place.

If Datadog allows us to trigger monitors and webhooks when certain metrics exceed a certain threshold, we still need to define which threshold, on which metric, indicates a potential security incident.

In this post I will not go into detail about the definition of these rules. The definition of these rules depends greatly on the technologies used, the security culture of the company and its ability to observe events.

In my opinion, it is important to start with a few alerts, but with all of them being relevant (then to reinforce step by step). Rather than creating a multitude of alerts that are not always relevant, creating a lot of false positives and drowning out the important flags.

Let’s take the example of a deliberately simple use-case that will follow us for the rest of the post. Let’s imagine that we want to trigger a security incident when our Ecommerce website is the target of a security scan. Consider that this incident would be triggered by the following rule:

A security scan is detected if an IP is blocked more than 50 times by the WAF in the last 5 minutes.

As we receive all the WAF events in Datadog it is possible to create a monitor to detect this specific use-case with the query being:

sum(last_5m):avg:mm.waf.blockbyip{*} by network.client.ip,profile} .as_count() > 50

Now that we are able to detect security incidents, we need to catalogue (list or sort?) them and orchestrate and automate their remediation.

Note: Datadog offers a feature called “Anomaly detection” which is an algorithmic feature that identifies when a metric is behaving differently than it has in the past, taking into account trends, seasonal day-of-week, and time-of-day patterns. If it might be useful to monitor metrics with strong trends and recurring patterns that are hard to monitor with threshold-based alerting, we only use monitors with simple thresholds to detect security incidents that can be applied to any log collector.

Anything done with Datadog today can be done with a classic ELK stack, coupled with Elastalert to detect suspicious behaviour. It’s just a matter of having a log collector and aggregator to detect incidents to be sent to our SOAR.

2. SOAR Platform — Automate the response to security incidents

TheHive — Security incident response for the masses

Figure 2 — TheHive logo

Most case management systems are just some help desk ticketing systems that have been adapted to fit a security use case.

The creators of TheHive define it as a :

“scalable 4-in-1 open source and free Security Incident Response Platform designed to make life easier for SOCs, CSIRTs, CERTs and any information security practitioner dealing with security incidents that need to be investigated and acted upon swiftly”

Basically, theHive is an incident response platform that allows our security team to:

  • Collect security events at a centralised point,
  • Qualify them and easily start an investigation in case of a potential incident,
  • Share tasks while ensuring that all participants are at the same level of information,
  • Find out whether an observable has ever been seen in an investigation and how it has been processed.

TheHive’s API simply allows us to create cases regardless of the source that detected the security incident. In our case, the detection of the incident is done via Datadog monitors when the logs/events are present in Datadog or directly via our security tools which have no interest in pushing information into Datadog (for example, if a hunter creates a vulnerability report via our Bounty Bug program, a call is made directly to the theHive API to create a case).

In reality, as shown in the figure 1 , there is an intermediate stage between the detection of the incident and the creation of the case allowing the information to be formatted (Hello n8n, but we will come back to this later)

A case corresponds to an investigation. In addition to metadata such as the creation date, the person in charge of the investigation, a description, a level of criticality, TLP, relationships with other investigations … a case is made up of a set of tasks (Ban the IP, run the malware in a sandbox, communicating internally, etc.) and observables.

If cases can be created from scratch, they can also be created from templates predefining tasks, custom fields, criticality, TLP…

We (almost) never create a case from scratch, each creation is based on a template created manually beforehand and allows most importantly to associate a list of tasks to a case.

In the same way, the creation of the cases based on a template is (almost) always done automatically via the trigger of an abnormal behaviour.

Let’s take the example from earlier, when an IP is blocked more than 50 times by the WAF in less than 5 minutes:

  • The datadog monitor is automatically triggered
  • All the relevant information (IP, WAF profile, link to the logs, etc.) are summarized in a case
  • The case having been automatically created using a template for this type of incident, a list of actions to resolve the incident is created. In this case the only action is “Ban the IP”.
Figure 3 — TheHive case when a security scan is detected

Thus, for each incident, whatever its source, a theHive case is automatically created with all the relevant information and a list of tasks to respond to it.

The response to an incident is divided into one or more tasks. And some different incidents still share some subtasks.
Many of these tasks can be automated and some incidents can be fully automated, from detection to remediation.

n8n— Extendable workflow automation

Figure 4 — n8n logo

Automation is the key to success. Automating the response to an incident when possible is essential and permit two things:

  1. Work Smarter: Automate repetitive tasks to force multiply our team’s efforts and better focus our attention on mission-critical decisions.
  2. Respond Faster: Reduce dwell times with automated investigations. Reduce response times with playbooks that execute at machine speed.

n8n is a great, open-source, extendable workflow automation.

There are several reasons why we decided to use N8N but it is mainly to stop creating technical debt by making multitudes of custom scripts.

In our platform, n8n allows us to automate 3 key features:

Figure 5 — SOAR platform flow
  1. It is the entry point for all of our incident detection. We have one n8n workflow per source that manage the creation of the formatted the hive case (with the associated template and necessary information).
  2. A big generic workflow that automates the life of the case on TheHive. In particular, it allows to:
Figure 6 —n8n worflow to orchestrate a theHive case
  • Automatic start of the investigation.
  • Close case when all tasks are done.
  • Alert the team on slack as soon as a case with a certain criticality has been created, with the list of associated actions.
  • And most importantly, trigger another n8n sub-workflow when a task belongs to n8n.

3. And the core of incident response automation: n8n takes care of most of the tasks required to resolve the incident

Then when a case on TheHive is created, n8n will retrieve the list of associated tasks and for each task of which it is the owner, he will trigger a sub-workflow that match the task name.

And it is in this strong relationship between theHive and n8n that the power of the solution lies.

Taking the simple use-case from earlier, here is what happens when an IP is blocked more than 50 times by the WAF in less than 5 minutes:

  • The datadog monitor is automatically triggered and sends all the information to a n8n workflow,
  • All the relevant information (IP, WAF profile, link to the logs, etc.) are summarized in a TheHive case,
  • The case having been automatically created using a template for this type of incident, a list of actions to resolve the incident is created. In this case the only task is “Ban the IP”,
  • As the task “Ban the IP” is owned by the user n8n on theHive, the main n8n workflow will start the investigation and the completion of the task,
  • A n8n sub-workflow of the task name will be triggered, in this case “Ban the IP”.
Figure 7 —n8n worflow to ban an IP
  • This subworflow will fetch the observables and additionnals fields necessary to it to be executed
  • Once executed, the task is closed with proof of execution as a comment, as all tasks are completed, the incident is closed.

The incident was thus automatically resolved without any human action in a couple of seconds.

The “Ban the IP” workflow, like all other “case_task” workflows, being generic, it can (and must) be reused for any type of incident that requires this remediation.
As soon as a case TheHive contains this task and assigns it to the n8n user, the task is automatically executed.

In some cases, the resolution of the incident cannot be 100% automated. But it can be divided into several tasks, some of which can be automated. This ensures that our team can concentrate their time on high value-added actions.

Security Automation, Orchestration and Response (SOAR) is changing the world of security operations, incident response and governance.

ManoMano has chosen to take advantage of two powerful open-source tools to build a SOAR platform best suited to our ecosystem and automate everything that can be automated.

Credit to Florian Gaultier (@agixid), for sharing with me about the combination of n8n and TheHive to automate security events ❤

Thanks to the creators of TheHive and n8n for their innovative products and their great community!
Edit : At the time of publication of the article, n8n@0.96.0 has just been released with new nodes for TheHive 🎉

--

--