Building scalable SOAR infrastructure

Costin Canciu
3 min readJan 30, 2022

Nowadays, cybersecurity teams are facing massive burden — they need to handle huge number of incidents, which overcomes the amount of available resources. Manual tasks, missing documentation and lack of personnel have left security operations teams struggling to keep up with the growing volume of threats. This is where security orchestration, automation and response (SOAR) comes into play. SOAR platforms typically provide case management capabilities for security incidents, in order to support efficient investigation. This helps companies improve overall KPIs and save valuable time and costs. This is done through different automations supported by the platform — playbooks, scripts, enrichment.

Why do we need scalability?

Cybersecurity teams have to cope with more vulnerabilities than ever due to the rapidly expanding attack surface. The amount of attack vectors has grown substantially over the past years since data is migrated to the cloud, and this is expected to grow exponentially.

Large companies can expect a huge number of incidents per year — hundreds of thousands, while the alerts count can exceed one million. In the same time, there will be multiple security analysts working in parallel. This will generate a huge number of simultaneous jobs, resulting in massive load for the SOAR platform.

Moreover, the number of SOAR scripts and automations will grow over time. The source code needs to be written efficiently in order to achieve resilience and to shorten future development efforts.

I. Choosing the architecture

One of the earliest stages and probably the most important is the planning & design phase, when the deployment specifications are buiilt, both hardware and software.

Most of the SOAR solutions provide the option of a distributed deployment, where multiple servers can be added as “nodes” to load balancing groups. This allows to split the workload between different integrations or instances, and can greatly improve the platform performance.

Another out of the box feature is setting a backup server for the master server. Users should not experience any downtime when it comes to a platform that handles security incidents. Besides the risk of critical alerts not being handled, this can also lead to inconsistencies in data and reports for historical data. If the SOAR platform does not come with a failover solution, the OS level backup should at least be performed.

II. Dropping duplicate data

Since most of the incidents are generated by a SIEM, multiple duplicate alerts might be generated, especially if throttling is not well configured. Those are occupying unnecesary space, and they are also generating unnecesary scripts and automations calls. Having an automated process in place for handling duplicate alerts has a huge impact in the overall incident handling process.

III. Development

Most of the SOAR platforms allow the development of scripts & automations using Python. Since Python 2 is no longer maintained, all scripts should be developed using Python 3+ version.

A critical aspect is ensuring the quality and maintainability of the scripts. Source code must be written according to the clean code principles of programming, such as DRY — Don’t Repeat Yourself or KISS — Keep It Simple and Stupid.

Naming conventions should be enforced for both scripts and variables, and consistency should be applied across the whole SOAR codebase. A versioning system such as GIT and a CI&CD pipeline can also help in assuring a healthy SOAR development process.

Higher components should be split into smaller, reusable components. The number of API calls should also be limited as much as possible to remove overhead. In the end, the script should be optimized for high performance.

--

--