How OWASP Coraza Improved Performance by 100x

Juan Pablo Tosso
9 min readMar 21, 2023

Special thanks to the OWASP Coraza team, specially anuraaga, jcchavezs, and m4tteoP from Tetrate, and the companies that provided valuable research and benchmarks, Intel, Tetrate, and Traceable.

In this article, we will discuss the process of optimizing the code of OWASP Coraza, a web application firewall library written in Go. The Coraza team, including me, improved the performance of the WAF from 700 events per second to 70,000. We will discuss the various techniques and strategies we employed to achieve such significant improvements in performance. This article will particularly, interest developers looking to optimize their Go applications or those interested in learning more about web application firewalls and their underlying technology.

The results from the title can vary. Benchmarks were created using https://github.com/jptosso/coraza-benchmark-2 in a Macbook Pro m1 with 16gb memory on a simple GET transaction. Benchmarks are also available for more complex use cases.

About the author

Juan Pablo Tosso is the author of OWASP Coraza and co-leader. He has more than ten years of professional experience in web application security and works as Research Engineer at Traceable AI.

Coraza Internals

Coraza WAF uses a set of directives from files, strings, or code configurations to create a WAF instance. This includes rules.
A WAF instance can be used to create Transactions. A transaction is an immutable object that can be enriched with data by consuming the provided helpers. Certain helpers invoke rule phases from 1 to 5, such as tx.ProcessRequestHeaders will invoke phase 1 rules, and if configured, it might also perform a disruptive action (interruption).

Other elements are involved in evaluating transactions, like Transaction Variables, Body Processors, Audit Loggers, Rule Operators, Rule Actions, and Rule Transformations. More info can be found at https://www.coraza.io/

Coraza is designed to work similarly to ModSecurity but focuses on v2 rather than v3, which provides better compatibility with OWASP Core Ruleset.

  • Actions are intelligent: Actions are programmatically assigned, and they will run only when they are required. For example, metadata actions will be evaluated on bootstrap but won’t make it to the rule itself. Unlike ModSecurity, which runs all actions every time you evaluate a rule.
  • Caching transformations: Coraza implements a transformation caching mechanism to avoid repeating the same transformation multiple times in a single transaction.
  • Variables are generic: ModSecurity uses one class per variable, in Coraza, we only consume six (6) optimized collection types. Considerably reducing the complexity of variables.
  • Rules are immutable: They don’t get copied once per transaction, like ModSecurity does, copying the RuleSet object. It means multiple Transactions can consume the same rules without worrying about race conditions.
  • Plugins: Internal modules like, Rule Actions, Rule Transformations, Rule Operators, Audit Loggers, and Body Processors can be extended by consuming the plugin helpers.
  • Response Body Processing: Coraza provides response body processing features, as most APIs will return with a JSON or similar data-structured response.
  • The URL Path is also a variable: Modern web applications might use restful paths that might contain important information like a resource ID. That’s why Coraza is capable of parsing URLs and adding the parameters as input variables in ARGS and ARGS_PATH

Go profiling

pprof is a profiling tool in the Go programming language that helps developers identify performance bottlenecks in their applications. It works by generating a CPU profile or a heap profile of a running Go program, which can then be analyzed to identify slow or memory-intensive sections of the code.

To use pprof from the terminal without code, you can start by running your Go program with the necessary profiling flags. Specifically, you need to use the -cpuprofile flag to specify the name of the output file for the CPU profile data, and the -memprofile flag to specify the name of the output file for the heap profile data. For example:

go run -cpuprofile=cpu.prof -memprofile=heap.prof myprogram.go

This command will run your Go program and generate two profile files, cpu.prof for the CPU profile data and heap.prof for the heap profile data.

Once you have generated the profile data, you can use the go tool pprof command to analyze the data and generate a report in HTML format by adding the -http flag to specify the port number for the web server where the report will be hosted, and the name of the profile data file as an argument. For example:

go tool pprof -http=:8080 cpu.prof

Challenges of maintaining cross-platform support

We currently support two compilers, goand tinygo, so we must ensure everything we optimize works for both compilers and doesn’t affect the performance of any technology.
The benefit of this is that Coraza can be compiled into Webassembly, therefore it can be consumed by WASM VMs and proxy-wasm implementations.

First profiling

Macro Expansion optimization: Compiling messages with macro expansions into strings was one of the first bottlenecks. That would require us to tokenize the message on each run and look for macros (%{tx.request_id}) using regular expressions. To optimize this, we transformed the tokens into a struct that would tokenize the string into static and variable parts. That way, each time the Macro Expansion is evaluated, we would concatenate the tokens and compile each variable programmatically.

Aho-Corasick optimization: We previously used Cloudflare’s static implementation of Aho-Corasick, which used hardcoded configurations that weren’t efficient enough for our use case. This implementation allowed us to analyze content using the pm operator efficiently.

Optimizing Rule Matching: Another bottleneck was the mechanism used to match a rule and generate variables like MATCHED_VARS, and MATCHED_VARS_NAMES. To optimize this, we manually nullified old variables instead of relying on the GC and added conditions to avoid duplicated allocations.

General allocation reduction: We went through the whole code to find general performance issues, mostly:

  • Reducing loops complexity and steps
  • Remove unnecessary or duplicated tasks
  • Preallocate resources with approximations of exact values

Sync Pools

In OWASP Coraza, transactions that contain hundreds of variables are created and used frequently during runtime. Since creating and initializing these transactions can be expensive in terms of time and resources, it’s important to find a way to reuse them efficiently to improve performance.

This is where the sync.Pool comes in. The sync.Pool is a built-in Go library that provides a way to reuse objects of a certain type, which can help to reduce the overhead of creating and initializing new objects.

When a transaction is created, it is added to the sync.Pool instead of being immediately garbage collected. The next time a transaction is needed, instead of creating a new one, the application can retrieve a previously created transaction from the sync.Pool.

By reusing transactions in this way, the application can avoid the overhead of creating new transactions and allocating memory each time, which can improve performance and reduce memory allocation.

Refactoring Variables/Collections

Variables in Coraza v2 are handled as map[string][]string, which means we allocate have to allocate at least two resources per variable. We have to iterate a map and a slice to get a single variable like REQUEST_ID.
To solve this problem, Coraza v3 implements what we call collections. Each collection is generic in terms of how it is consumed by the transactions, but internally each one solves different use cases. The following collections are implemented:

  • Map: It stores a map[string][]string and returns the list of elements belonging to the key string or regex, for example, TX and ARGS_GET.
  • Concat: It takes multiple maps as input and returns the results for all maps. For example, ARGS returns ARGS_POST, ARGS_GET, and ARGS_PATH.
  • Named: It returns the list of “key” names for multiple map collections, for example, ARGS_NAMES.
  • Noop: It does nothing. It’s used for non-implemented collections.
  • Single: It can only store a single value.
  • Sized: It’s a single programmatic collection that only returns the count of values.

Other types of collections can be implemented for special dynamic variables like TIME, and ENV.

As a result of this refactor, we can see fewer allocations, and faster access to the data, as:

  • Concat, Sized, and Named variables are abstract and connected to other variables, so we are just pointing to other collections and we avoid duplicating data.
  • Single Variables only allocate a single string instead of a complex map. Making it fast and a single step to access the data.
  • Non-Implemented variables won’t allocate anything because they are Noops.

Performance regression tests in CI

A benchmark regression Github Action was implemented to avoid PRs that would make Coraza slower. In case of a performance regression, the PR will be blocked and a detailed description of the benchmark differences will be created as a comment.
This is important, as it ensures Coraza never becomes slower, it also stores a history of benchmark results for all commits.

Immutability refactor

OWASP Coraza uses immutability as a design pattern to ensure that data resources shared across functions are always interfaces, which helps to prevent the allocation of unnecessary variables. This can help to optimize memory usage and improve performance in Go applications.

In addition, immutability also enhances security by preventing developers from making certain mistakes that could introduce vulnerabilities, such as inadvertently modifying shared data or creating race conditions. By using immutable data structures, developers can ensure that data is always consistent and predictable, which can help to reduce the risk of security vulnerabilities.

Overall, the immutability design pattern used in OWASP Coraza helps to promote good programming practices and improve the security and performance of Go applications.

Bytes to string memory hack

It is not a core optimization, but this is an exciting example of reducing memory allocation as much as possible. This hack, implemented by anuraag, provides a mechanism for byte slices to be transformed into strings without allocation. If you do []byte(“text”) or string([]byte(“text)), you will get an allocation as you are creating a copy of the string/[]byte.
You can use the following code to transform the slice into a string without allocating memory:

// WrapUnsafe wraps the provided buffer as a string.
func WrapUnsafe(buf []byte) string {
return *(*string)(unsafe.Pointer(&buf))
}

Note this only work for certain use cases, as it might create a security risk if not correctly implemented (not null-terminated byte slices).

Zero allocation debug logs

Coraza relies on debug logs for development and rules debugging, therefor some trace-level logs are part of critical functions that can be easily called millions of times per second.
Intel research team submitted a document blaming debug logs of multiple unnecessary allocations. We could take two paths: implement a noop debug logger that would receive arguments as “_”, preventing allocation. Or, as suggested by Intel, replace our debug engine with Zerolog, which provides fast, reliable logging with low to zero allocations.

Quoting Intel:

The Zerolog is a simple fast JSON-oriented logging framework in Go that aims to avoid allocations and reflection. It is widely used.

Analysis so far shows the impact of Debug log statements in coraza library (and other heap allocations) on the overall performance of coraza library. We need a better approach for logging in Coraza library so that Debug statements in the code doesn’t allocate memory in heap unnecessarily (For example, if debug logs are not enabled then heap allocations should be low).

Let’s try using zerolog library which is optimized in terms of memory allocation. Let’s replace some of the existing Debug log statements with zerolog APIs. We’ll use zerolog Debug API in doEvaluate() function which had lot of memory allocation as we saw in previous section.

Changes to use zerolog in coraza library can be found in following commit

https://github.com/manojgop/coraza/commit/65dbac0e15a1634cd856cac61b6794d7c4909257

Following is the output after running the benchmark tests using zerolog library for logging. For the convenience of the reader, the output of previous run is also pasted below for comparison.

Total improvement in performance was about 10% to 11%.

Pending problems

Although OWASP Coraza is fast and lightweight, we can still be faster. We are constantly finding new ways to improve performance and here are some problems we are already trying to figure out.

  • Replace the regex engine. The current engine (go re2) consumes about 18% of the used resources.
  • Maintain the current Variables/Collections performance but provide a plugin architecture to allow custom variables. The current implementation requires many references to the collection pointer in the transaction, transaction state, and rules execution workflow.
  • Implement session persistence without affecting performance.

Conclusion

In conclusion, the performance optimization of OWASP Coraza by the Coraza team resulted in a significant improvement in the WAF’s speed, from 700 events per second to 70,000 events per second. The team employed various techniques, including profiling with pprof, caching transformations, reducing allocations, optimizing rule matching, and using sync pools. These optimizations have made Coraza a more efficient and faster WAF, making it an excellent choice for engineers looking to implement a WAF or those interested in learning more about web application firewalls and their underlying technology.

Some general tips:

  • Use strings.Builder, if you know your string's length or have an approximation, you can use the Grow function to preallocate the memory.
  • If you know the length of the slice that will be created, use the make function to preallocate all the memory or use an array.
  • Try to send references instead of values to functions.
  • Use pprof to find which functions are spending the most resources.
  • Recycle memory if possible. Sync Pool is a great alternative.

Some links

--

--