Building an Effective NiFi Flow — ReplaceText

Mark Payne
6 min readMar 27, 2023

--

The ReplaceText Processor is fairly well known. It provides the ability to perform a “search and replace” action against text as it flows through the system. Additionally, it can be used to append and/or prepend text to each line of a file or to the entirety of a FlowFile. There are, however, a few tips that can transform the Processor from a helpful tool to a powerhouse.

Beware the “Entire text” Evaluation Mode

I often see the “Evaluate Mode” property set to “Entire text.” This property controls whether the search-and-replace action is to be performed on each line individually or against the entirety of the text all at once. For the vast majority of use cases, this should be set to “Line-by-Line.” Setting this to “Entire text” means that the Processor may have to load the entire contents of the FlowFile into memory, which is something that we want to avoid.

In particular, each Concurrent Task may load into memory the entirety of a FlowFile. So consider that if the Processor is configured with 8 concurrent tasks, and each concurrent task is evaluating a 1 GB FlowFile, up to 8 GB of NiFi’s heap could be used up by this Processor. This can quickly lead to OutOfMemoryErrors and/or large amounts of garbage collection. Both of these can cause cluster instability and processing failures, and generally poor performance of the NiFi instance.

The “Maximum Buffer Size” property limits how much each concurrent task can buffer into memory. However, if this is set smaller than the size of a FlowFile, it means that the entire contents of the FlowFile will not be loaded into memory, and as a result, we replacement will only run against the first part of the FlowFile.

The only time that the Evaluation Mode should be set to “Entire text” is if the “Search Value” property includes carriage returns or newlines. Otherwise, stick to the default value of “Line-by-Line.” This is much more efficient and will keep NiFi operating at peek performance.

Regex Replace with Capturing Groups

A powerful that may not be immediately obvious is the ability to use Regular Expressions with Capturing Groups. Capturing Groups allow you to capture a portion of text that is matched by your Regular Expression.

For example, suppose that the incoming FlowFile has the following text:

Oddly|Delimited#Text_With#Different|Delimiters
Mark|NiFi#Data_Processing#Effective|Flows

Now, suppose we want to change “Delimited#Text_With#Different” to “Filtered#Text_With#Filtered” and change “NiFi#Data_Processing#Effective” to “Filtered#Data_Processing#Filtered”. Essentially, we want to replace the words between the | and # delimiters, and the words between # and | delimiters with the word “Filtered”. But we want to leave the middle alone.

We can accomplish this using Capturing Groups. For example, we could use a Search Value of

(.+)\|.+#(.+)#.+\|(.+)

And a Replacement Value of:

$1|Filtered#$2#Filtered|$3

This provides the expect result:

Oddly|Filtered#Text_With#Filtered|Delimiters
Mark|Filtered#Data_Processing#Filtered|Flows

Capturing Only a Portion of the Line

A common mistake with ReplaceText is that users assume that if they use the “Regex Replace” Replacement Strategy, the Regular Expression must match the entire line. However, it only needs to match some part of the text. So, in the example above, we can simplify our configuration to use a Search Value of:

\|.+#(.+)#.+\|

And our Replacement Value then becomes:

|Filtered#$1#Filtered|

It’s still a bit complex here because the Regular Expressions are matching some odd characters. But this allows us to simplify the expression and ignore the parts that we don’t care about. Those will simply be left unchanged.

While this does make things simpler, it’s also important to consider because it is also much more efficient. Matching patterns such as .* and .+ in a Regular Expression is very expensive. By leaving these out, NiFi is able to evaluate the Regular Expression much more quickly, reducing the CPU resources necessary.

Mixing in Expression Language

Capturing Groups can be useful for capturing a portion of a line and retaining that value while replacing the rest of the line. But we can also combine Capturing Group with NiFi’s powerful Expression Language. This gives us the ability to perform any type of Expression Language function we want against the text.

Take, for example, the following text as an incoming FlowFile:

The angry wolf huffed and puffed.
Then the wolf blew the house down!

We can then use Expression Language to perform any number of functions against some value that is captured by our Capturing Group. For example, we can change the word “wolf” to uppercase. For this, we’d use a Search Value of simply:

(wolf)

With a replacement value of:

${'$1':toUpper()}

Which yields us:

The angry WOLF huffed and puffed.
Then the WOLF blew the house down!

Do note that in the Search Value we use parentheses around the word “wolf” in order to make it a Capturing Group. And in the Replacement Value, we have to enclose the $1 in single-quotes (or double-quotes, either will work) because it contains the special character $.

We could get a bit more interesting, as well. Perhaps we want to perform some hashing function. So consider the input text:

name: Mark
password: super-secret
favoriteProject: nifi

If we want to hash the password, we can use a Search Value of:

password: (.*)

And our Replacement Value becomes:

password: ${‘$1’:hash(‘SHA-256’)}

Which yields our expected result:

name: Mark
password: aec8084845b41a6952d46cbaa1c9b798659487ffd133796d95d05ba45d9096c2
favoriteProject: nifi

Also, note here that because our Regular Expression matches the “password: ” part, we need to include it also in our Replacement Value, or it will be replaced.

This ability to use Expression Language in conjunction with Capturing Groups is extremely powerful and lends itself well to all sorts of transformations.

Surround Replacement Strategy

The last pointer to mention is the addition of the “Surround” Replacement Strategy. This was introduced in version 1.20.0. Prior to this, if you wanted to insert some text at the beginning of the file and at the end, you had to use two options:

  1. Use a ReplaceText with a Replacement Strategy of Prepend, and a second ReplaceText with a Replacement Strategy of Append. This was inefficient, as it meant copying the content twice.
  2. Use Regex Replace with the Evaluate Mode set to “Entire text” and use “(.*)” as the Search Value, with a value of “<prefix>$1<suffix>” as the Replacement Value. This is even more inefficient, as it buffers the entire text into memory, and this can quickly lead to OutOfMemoryErrors, excessive garbage collection, and cluster instability.

Starting with 1.20.0, we can now use a value of “Surround” for the Replacement Strategy. Choosing this option then provides two additional (required) properties: Text to Prepend and Text to Append. This gives us the ability to easily insert text at the beginning and the end of the file, such as a Header and a Footer.

Conclusion

ReplaceText is used pretty heavily, both by experienced NiFi users and newbies alike. It’s incredibly powerful for dealing with textual data, but it can also be dangerous by exhausting heap if you load the entire contents of a FlowFile into memory.

General guidelines for using the processor:

  • Never use an Evaluation Mode of “Entire Text” unless you plan to match newline characters in your search expression. And then, ensure that you have a reasonable limit set for the Maximum Buffer Size (the default 1 MB is often very appropriate).
  • Prefer a Replacement Strategy of “Surround” instead of two Processors in order to prepend text and append text. But this option is only available since version 1.20.0.
  • Avoid using expensive expressions such as .* in your Regular Expression when you can — instead, just include the part of the text that you need to match. You don’t have to match the entire line of text.
  • Use $1, $2, $3, etc. to reference Capturing Groups along with Expression Language in order to apply Expression Language to the contents of your FlowFile.
  • When possible, prefer Record-based processors, such as UpdateRecord, over ReplaceText, as it’s more efficient and less error-prone than using Regular Expressions.

See Also

Previous: RouteText

--

--