Throughput of Splunk Ingest Actions with Regular Expressions: Best Practices

Brent Davis
Splunk Engineering
Published in
13 min readJan 5, 2022

Outline

  • Background
    - Regex Implementations
    - The Regex Engine in Splunk
    - The PCRE2+JIT engine
    - Backtracking
  • Principles for crafting performant Regex expressions
    - Principle #1: Specific is Better than General
    - Principle #2: Be Lazy if you can
    - Principle #3: Use Anchors when Possible
    - Principle #4: Tune Alternation and Character Classes
    - Principle #5: Use Negation in Character Classes
    - Principle #6: Use Capture Groups Sparingly
  • Other Considerations
  • Summary

Executive Summary

Splunk Ingest Actions enables customers to leverage regular expressions (regexes) to filter and mask incoming data. Regexes are highly flexible, and the means by which a regular expression is crafted can have a large impact on the overall performance of that regex.

After presenting some background on the regex engine, this discussion presents principles aiming to help customers optimize their regex expressions. Utilizing real-world examples with Ingest Actions, we show six techniques that make a dramatic improvement in data processing throughput.

We investigate the value of these principles in a simple test case with the following results:

  1. Specific is Better than General — 500% improvement over base case
  2. Be Lazy if You Can — 25% improvement
  3. Use Anchors when Possible — up to 72% improvement
  4. Tune Alternation and Character Classes — 10% improvement
  5. Use Negation in Character Classes — up to 350% improvement
  6. Use Capture Groups Sparingly — demonstrated 7% improvement.

Background

Introduction

Quite simply, a regex is a way to describe, match, and transform a piece of text.

For any system applying a regex to thousands or millions of events per second, the performance of that regex can become a dominating factor in the overall performance, latency, and resource usage of the system.

In the pages below, we examine what this means to Splunk customers as they adopt the Ingest Actions product. First, we cover some key background details of how regexes work. General knowledge of the regex language is assumed.

Regex Implementations

Regular expressions can be thought of in two parts: the regex language and the regex engine. The regular expression language is effectively a specialized programming language designed around processing text. The regular expression engine is fundamentally an implementation of how the regex language is interpreted and applied.

There is much flexibility in the regex language, and there are many implementations of the regex engine. Consequently, both the way we write a regular expression and the regex engine itself have implications for the overall performance of our data processing system.

The Regex Engine in Splunk

Splunk customers may already be familiar with regex expressions in Splunk, using the | rex SPL command. This command allows the user to apply a regular expression to a SPL search query. Splunk recently introduced a new beta feature, Ingest Actions, which allows Splunk administrators to apply regular expressions to data as it is ingested. Ingest Actions enables masking or filtering data on the ingest stream, and then indexing or routing that data elsewhere.

With either search or ingest, the same core regex engine is invoked. And in both cases, we are constructing a regular expression statement and running it through a regex engine over a potentially huge number of events.

In the 70 years that regex expressions have been in use, many different regex engines have been developed, with different feature sets and optimizations. Therefore, whenever we talk about regex performance, it is important to understand a bit about Splunk’s regex engine and its particular properties.

The PCRE2+JIT engine

As a regex engine, Splunk uses the open source PCRE2 (short for Perl Compatible Regular Expressions) library.

PCRE2 is a good choice for Splunk’s regex engine: it is widely adopted, well performing, full featured, and standardizes on a pattern-matching language using the same syntax and semantics as Perl.

Splunk also enables an optional feature of PCRE2: JIT, or just-in-time compiling. JIT is an optimization that can greatly improve the speed of pattern matching by compiling patterns with architecture-specific code. With this optimization, there is additional processing that must be done before a match is performed, but once a pattern is compiled, matches are much quicker.

JIT is most helpful when the same pattern is going to be matched many times — for Ingest Actions customers, this is the common case: a relatively small number of regexes applied on a per-event basis to millions of events.

Backtracking — a Dominant Consideration in Regex Performance

Most modern regex engines, including PCRE2, implement a recursive algorithm called backtracking. Backtracking is how regex engines achieve their flexibility, and is generally what a regular expression does over the course of searching a string.

To understand backtracking, consider a case where an optional quantifier (such as +, * , ?, or |) exists in a regex pattern. When the regex engine tries to match this pattern to a string, it will attempt to match the longest string possible. If the last character of the string is not a match, the engine will move backwards in the text and try to match a shorter string. This backtracking will continue, character-by-character, storing state along the way, and return to each previous saved state in the search for a match.

Backtracking is a very important concept in regular expression performance, and is often the biggest factor in the performance of a regex pattern. The construction of a regex can have a significant influence on the amount of backtracking the engine must do. As we enumerate best practices for crafting performant regex expressions, we’ll find, in large part, the better performing regexes will do the least backtracking.

Principles for crafting performant Regex expressions

Test Methodology

We want to broadly describe and understand the best practices for well-performing regex expressions when processing data with Splunk. As such, we survey several examples of regex processing utilizing Splunk Ingest Actions and Splunk’s PCRE2 regex engine.

The test methodology is simple: we utilize a few samples of single-line events; for these events, we exercise a regex expression and explore several best practices to make it perform better. Our benchmark consists of a set of identical events, with monotonically increasing timestamps, all fed through an Ingest Actions ruleset.

The Splunk Ingest Actions feature allows Splunk Administrators to use regex in Masking or Filtering. These tests apply to either scenario.

Principle #1: Specific is Better than General

For our first test, let’s compare two regexes trying to extract some information from a line of DHCP log data, below:

<22> 2021–01–02 15:04:05 10.10.10.10 dhclient[3209]: DHCPREQUEST on eth0 to 10.234.1.1 port 67 (xid=0x5b9566d8)

Let’s say we want to extract the second IP address, the port, and the DHCP message out of this log.

A very general way to do this would be to think of it positionally: construct a regex that moves down the string, looks at the characters surrounding what we want, and wildcard everything else. The regex below uses this technique to successfully extract our data:

.*]: (.*).* on.*to (.*\..*\..*\..*) port (.*)

Indeed we get the data we want — we simply move forward in the string until we reach ]:, capture the DHCP message, and proceed with a similar strategy for the other two fields.

Executing this in our Ingest Actions test where the regex pattern is applied to 10 million events, we see this regex can process data through Ingest Actions at 36,918 events per second (EPS).

Is there a way we could improve this? There are many wildcards in this regex expression. As we discussed earlier, these optional quantifiers create a lot of backtracking and increase overhead.

What happens if we rewrite the expression to be as specific as possible? Examine the regex below:

<\d{2}> [12]\d{3}-[01]\d-[0–3]\d [012]\d:[0–6]\d:[0–6]\d (?:[0–9]{1,3}\.){3}[0–9]{1,3} \w+[[0–9]+]: ([A-Za-z]*) on [a-z]{3}[0–9]{1,2} to ([.0–9]*) port ([0–9]{1,5})

This is quite a bit longer, and very specific. We are clear when to expect a digit versus a non-numeric character, how long of a string to expect, and we eliminate most wildcards. Stated another way, we eliminate a lot of the possible paths for the regex engine to backtrack.

The results are striking: running our same benchmark, the Ingest Actions is able to process the same data at a rate of 218,718 events per second. This is a nearly 500% improvement over the previous regex.

This example shows just how much difference a well-tuned regex can make, and it leads us to our first principle of regex expressions: specific is better than general. The improvement in this regex against many millions of events per second is dramatic in the increased processing throughput.

Principle #2: Be Lazy if You Can

In the optimization above, we eliminate any “.*” wildcards from our regex. “*” is an example of a greedy quantifier. Greedy quantifiers strive to match as many characters in a string as possible, and then backtrack, character by character, to find the overall match. Often, we don’t need this behavior, and can use a lazy quantifier instead. Lazy quantifiers match the minimal number of characters possible.

While not every greedy quantifier can be rewritten with lazy quantifiers, carefully considering whether lazy quantifiers can be used in a regex can yield sizable performance improvements.

Using our example, if we simply take our poorly-performing regex and change each greedy quantifier to a lazy one (“*?” Is the lazy version of “*”), we get the expression below:

.*?]: (.*).*? on.*?to (.*?\..*?\..*?\..*?) port (.*)

Even though we are not as specific as in the fully-optimized regex, with this change alone, we can see how powerful this optimization can be: increasing throughput from 36,918 EPS to 46,325 EPS.

This is a 25% improvement, solely because lazy quantifiers reduce the amount of backtracking. Note that we can’t change all quantifiers to lazy, however: we still need to move to the end of the string and check different states to get our full match.

One note here: greediness or laziness influence the order paths are checked, but not necessarily which paths are checked. This means that if no match is found, whether by greedy or lazy expressions, the performance is equivalent.

Principle #3: Use Anchors when Possible

Continuing our example, let’s look at another log type. This log entry is an sshd log, with a slightly different format:

Nov 17 15:08:39 localhost sshd[621893]: Failed password for nemo from 192.168.0.7 port 8132 ssh2

Now, we want to extract the time of all failed password attempts, with an expression such as:

((?:[A-Z]|[a-z]){3} \d{1,2} \d{1,2}:\d{2}:\d{2})

Using what we’ve learned so far, we’ve avoided greedy quantifiers, and are as specific as possible. In our benchmark, we see this regex processes our test case at 180,122 EPS — not too bad.

Let’s introduce the idea of an anchor. An anchor (^) indicates the start of a line, and limits the regex traversal, since it can only start from the beginning. This should limit the backtracking:

^((?:[A-Z]|[a-z]){3} \d{1,2} \d{1,2}:\d{2}:\d{2})

This action to anchor the line and reduce backtracking helps considerably, as our test throughput is now raised to 294,559 EPS.

This is a good point to discuss the case of a non-match. Non-matching regexes can have their own performance characteristics. The same principles apply, but the results could be different — as mentioned in “be lazy if you can,” the amount of backtracking needed for a non-matching case could vary significantly from a matching case.

So let’s check a case of a non-match. For non-matching lines being processed by this regex, we reintroduce the DHCP log data from our previous example. Our current regex pattern does not find any matches on this data:

<22> 2021–01–02 15:04:05 10.10.10.10 dhclient[3209]: DHCPREQUEST on eth0 to 10.234.1.1 port 67 (xid=0x5b9566d8)

The regex traverses the entire string to determine there is no match. Our benchmark shows processing completes at a somewhat comparable 257,463 EPS.

When we run the same regex, but anchor it, we tell the regex engine to stop very quickly if there’s no match at the beginning of the string:

^((?:[A-Z]|[a-z]){3} \d{1,2} \d{1,2}:\d{2}:\d{2})

In this case, we can process at 442,663 EPS — 72% better throughput.

Anchors can help in many situations: we avoid greedy quantifiers in this regex, but we would also see a strong improvement if they did exist — anchoring a string is a generally good practice to reduce the backtracking cost of a regex.

Principle #4: Tune Alternation and Character Classes

Often, we can use alternation (x|y) or character classes [xy] to achieve the same result. In either expression, we match an x or y.

How we create an alternation influences performance. Continuing with our DHCP log example, suppose we want to match a log entry with any possible DHCP messages except DHCPOFFER. Possible DHCP messages are (DHCPREQUEST, DHCPACK, DHCPDECLINE, DHCPDISCOVER, DHCPNAK, DHCPOFFER). Again, we use our example log snippet below:

<22> 2021–01–02 15:04:05 10.10.10.10 dhclient[3209]: DHCPREQUEST on eth0 to 10.234.1.1 port 67 (xid=0x5b9566d8)

A straightforward way to achieve our aim would be to create a regex that will match each desired possibility:

(DHCPACK|DHCPDECLINE|DHCPDISCOVER|DHCPNAK|DHCPREQUEST)

This gets us our desired match, and processes at 254,992 EPS.

However, we have a common string, “DHCP” in each match, so we can shorten this a bit, and just alternate on the second half of the string:

(DHCP(?:ACK|DECLINE|DISCOVER|NAK|REQUEST))

This improves performance a bit, to 262,902 EPS

But let’s go one step further — we know the set of possible DHCP messages, and we can safely just select the first letter to hunt for a match on:

(DHCP(?:A|D|N|R))

Again, we get a little performance boost, to 278,088EPS.

We’re using the alternation construct here, which means each alternate must be tried at each starting position. The character class [ADNR] would achieve the same result, what if we rewrote our expression as a character class instead?

(DHCP[ADNR])

This leads us to a final throughput of 282,225 EPS, a nice 10% improvement from our initial regex performance of 254,992 EPS.

Principle #5: Use Negation in Character Classes

Continuing with character classes, we next consider character class negation. Inside a character class, using a leading “^” allows us to list the characters not to be included in the match. For example, [^x] means to match any character that isn’t “x”.

In our example with DHCP messages, recall we want to match any message except DHCPOFFER. Using negated characters, we can rewrite our last expression:

(DHCP[^O])

This gives us a little simpler expression, but we get about the same performance as our previous tests, 282,177 EPS.

But let’s compare a more interesting example, and go back to our very first, very poorly performing regex which processed at an abysmal 36,918 EPS:

.*]: (.*).* on.*to (.*\..*\..*\..*) port (.*)

Looking at this from the case of character negations, we know there are never space characters inside the capture groups, so we can replace “.*” with [^ ] — ie, “not a space”:

.*]: ([^ ]*) on.*to ([^ ]*) port ([^ ]*)

This simple change eliminates a lot of greedy quantifiers which must backtrack. We see the benefit in throughput: 165,673 EPS, a 350% increase.

While this isn’t as good as the 500% improvement we achieved in our first, very specific regex, this expression illustrates the power of this character negation technique in reducing backtracking.

Principle #6: Use Capture Groups Sparingly

When writing a regex, we may need to use parentheses to specify an alternation or to apply a quantifier to a group of characters. As a side effect, any match in these parentheses is captured to an internal variable — this is called an implicit capture. Unsurprisingly, capturing and saving text to a variable does have an overhead, and it’s useful to avoid it when possible.

Throughout the examples so far, I occasionally use the notation (?:) inside a set of parentheses to do just this. I use these in previous examples so as to not influence a benchmark with an implicit capture group.

What is the real cost of these unnecessary captures? Since alternation has the side effect of creating a capture group, let’s examine again our DHCP messages example, below:

<22> 2021–01–02 15:04:05 10.10.10.10 dhclient[3209]: DHCPREQUEST on eth0 to 10.234.1.1 port 67 (xid=0x5b9566d8)

In this log entry, when we search for possible DHCP message types, we implicitly capture anything inside the alternation. For the sake of demonstration, let’s add several alternations to our expression:

(DHCP((ACK|DECLINE)|(DISCOVER|NAK)|REQUEST))

This executes at 248,218 EPS. We can benchmark the cost of alternations by eliminating the unnecessarily captured text with ?:, as below:

(DHCP(?:(?:ACK|DECLINE)|(?:DISCOVER|NAK)|REQUEST))

Here we get the same result at a slightly faster 267,447 EPS, or 7%. This difference is small, but with many capture groups, it can add up.

As another example, all through this document, I use a lot of capture groups to indicate and capture desired data. What if I just eliminated those capture groups? A customer’s ability to do this will depend on their use case and if they need those capture groups for later processing, but for illustration purposes, remember that our regex against the DHCP log with character negation completes at 165,673 EPS:

.*]: ([^ ]*) on.*to ([^ ]*) port ([^ ]*)

Let’s remove all capture groups entirely:

.*]: [^ ]* on.*to [^ ]* port [^ ]*

Simply removing all capture groups brings execution to 168,793 EPS, a 2% improvement. So the lesson here is, we shouldn’t use capture groups unless we need them, and it is helpful to disable any unnecessary implicit capture groups.

Other Considerations

Splunk has built-in limits to keep a regular expression from consuming too many resources. In poorly written regexes, recursive backtracking can have a very large number of possibilities and thus many states to traverse. Splunk exposes two limits of the PCRE2 engine to prevent runaway regexes from overly impacting the rest of the system. These limits are in props.conf for search, and transforms.conf for ingest.

MATCH_LIMIT: match_limit effectively limits the amount of backtracking the PCRE2 engine can do, by limiting the number of times the PCRE engine can call its internal match() function. This limit defaults to 100000 in Splunk, after which an error will be returned

DEPTH_LIMIT: depth_limit limits the depth of the backtracking recursion. Since not all calls to match()are recursive, this is simply another dimension on which to limit runaway recursion. This limit defaults to 1000 in Splunk.

Summary

Regular expression performance is a complex and important topic. Regex’s tremendous flexibility and power has led to their widespread adoption; for the Splunk customer, regexes can be an essential part of data processing logic in search and ingest.

However, the performance of a regular expression can vary widely, and poorly written regexes against large streams of data could have a considerable impact on the scalability of a system as a whole.

We’ve explained how Splunk utilizes PCRE2 regex, and benchmarked a number of expressions directly against Splunk’s regex engine. With a few best practices around crafting regular expression patterns, we’ve demonstrated principles that can be applied to achieve order-of-magnitude differences in overall regex performance. In short, when applying a regex to many millions of events, knowledge of the regex engine and these best practices can make an appreciable difference to overall performance.

--

--