Building an Effective NiFi Flow

Mark Payne
Cloudera
Published in
6 min readDec 5, 2022

--

I sat down with a customer recently. A very large customer who is using Apache NiFi for many use cases. The use case of interest this time was processing massive volumes of firewall data. Every access attempt logged. From many different systems.

The customer already had a mechanism for getting that raw log data into S3 from all of the different systems. Now, the need was to gather the log data from S3 and send any BLOCKED events to a database so that they could easily query the information. They also needed to send any and all firewall events to a second S3 bucket, this data transformed into Parquet.

Of course, there were some gotchas. There are always some gotchas.

Some of the data had millisecond-precision timestamps. Some had second-precision timestamps. The customer wanted to change it all to second-precision in order to be consistent. But one of the downstream systems that’s going to read the data expects a timestamp that has millisecond-precision. So we need to normalize the timestamp field.

Moreover, some of the timestamps were just completely wrong. Something like 12–04–2022 08 or an empty string would creep into the data. It appears to happen because of the intermediate system that pushes the raw firewall logs to S3 in the first place.

The use case did not require automating the correction of the data, but rather detecting any invalid records and pushing those to a separate S3 bucket. The data could be be in either the raw syslog form that it was received in, or could be in JSON format. But it needed to be human readable so that they could manually review the bad data.

Now, when I was approached by the customer, they already had a NiFi dataflow built, and it was working well, fulfilling their requirements.
But they reached out because the data rates that they could achieve paled in comparison to what they needed.

They then told me that over the next six months, they expect the incoming data rate to increase by more than ten-fold! They needed this system processing a LOT faster. Or they needed to know right away that NiFi wasn’t up to the task.

And their cloud spend was already massive. They couldn’t afford to increase that spend by more than 10x in order to match the incoming data rate. They needed something more efficient.

Fortunately, I already knew that NiFi was up to the task. How could I be so confident? The use case is very similar to one that I wrote about a couple of years ago: Processing One Billion Events Per Second with NiFi. In that article, I explored NiFi’s efficiency and scalability as a I scaled a NiFi cluster with 4 cores per node vertically to 96 cores, with linear scalability. I then scaled out from 1 node to 1,000 nodes. Here, I was also able to achieve linear scalability.

Now, the customer was using a lot of nodes. But they weren’t using anywhere near the 1,000 nodes that I demonstrated in that article. But that’s okay — they also didn’t need 1 billion events per second.

So I knew that NiFi was up to the task. But the dataflow probably would need to be refactored a bit. At the end of the series, I’ll write up a Case Study explaining exactly what the flow was that we started with, and the revised flow that we came up with. And we’ll analyze the performance difference.

But it got me thinking.

I know that NiFi can perform at breakneck speeds. I’ve proven this out. But the flows that users sometimes build just don’t perform well. And I can dig in and understand why they don’t perform well and help users to address those problems.

This begs the question: why aren’t users building dataflows that are as efficient as mine? What information can I provide to ensure that users are able to build their flows just as efficiently? What resources are there out there to teach users exactly how to build high-performance flows in the first place?

There’s already a huge amount of documentation in NiFi for most processors — especially the most critical ones. Last year, I published a series of YouTube videos on Apache NiFi Anti-Patterns — things that I see users doing over and over again that lead to dataflows that are slow, overly complex, difficult to understand and maintain, and/or inefficient. The series calls out specific patterns that we see and how to handle the issues more effectively. I’ve received fantastic feedback from these videos, and they seem to have helped many.

Chatting with the customer, it occurred to me that what’s missing from the documentation is not information about how to use a given Processor. Or mistakes to avoid when building your flows. What’s missing is knowing which Processors to use. NiFi includes something like 400+ Processors now. But which are the most important? Is there a list somewhere of Processors that make up the NiFi A-Team?

It turns out there is. But as far as I know, that list really only exists in the heads of people who’ve been working with NiFi for a long time. We’re really missing any sort of documentation that tells you:

Here are the most important processors. Learn these. Here’s what they are. Here’s what they do. Here’s when to use them and what they can do for you. Here’s HOW to use them.

And so, that’s what I’ve set out to write in this series. I want to expose you to my list of go-to Processors. The Processors that I use day in and day out to build flows that are straight-forward to implement, understand, and maintain. Flows that are blazing fast. And especially now, with ever-increasing data rates and soaring cloud costs, performance is important. Because the faster the flow runs, the fewer resources we need, and the less it costs us to run it. And the simpler the flow is, the easier (and cheaper) it is to train others to build and maintain them.

I’m assuming that you, as the reader, already have some familiarity with NiFi. That you know what a Processor and a FlowFile and a Controller Service are. That you know how to connect Processors together via Connections and that the route that a FlowFile takes out of a Processor is referred to as a Relationship. But I don’t expect you to know all the knobs to turn.

In short, I assume that you’re not a NiFi expert, but that you’ve spent more than an hour playing with NiFi. That you’ve already spent time building a few flows at least.

This series will introduce you to Processors that allow you to route your data. Processors that transform your data from one format to another — or one schema to another. Processors that perform data validation. Processors that enrich your data, using a CSV file, a database, a web service, or whatever is appropriate for your situation.

Processors like QueryRecord, which can be used to take in a stream of data and fork off many other streams based on whether or not the data matches a given set of SQL queries. This processor can also be used to perform transformations and filtering.

UpdateRecord is a powerful processor for easily updating the values of the fields in your data, while JoltTransformRecord excels more at altering the structure of hierarchical data. PartitionRecord can be used to take in a single FlowFile that consists of many Records and split out the data in a way that we only keep “like data” together (you decide what it means for two Records to be “like data”). This way it can be routed and distributed efficiently.

We’ll cover all of these processors and more in this series. And finally we’ll plan to close out the series by taking a look at that case study.

Oh, and by the way, the end result of refactoring the customer’s flow? A complex and difficult-to-maintain flow became trivial to understand and maintain. And the throughput was increased by about 2,000% — which translated to a reduction in cloud costs of about 95%!

So with that in mind, here begins my attempt at providing you with the information you need to build effective flows. Flows that are efficient, effective, easy to understand, and easy to maintain.

Starting with QueryRecord.

Cheers!

--

--