Building a Real-Time Data Pipeline: A Comprehensive Tutorial on MiNiFi, NiFi, Kafka, and Flink

Tim Spann
Cloudera
Published in
13 min readMay 26, 2023

Tech: MiNiFi Java Agent, Java, Apache NiFi 1.20, Apache Kafka, Apache Flink, Cloudera SQL Stream Builder, Cloudera Streams Messaging Manager, Cloudera Edge Flow Manager.

Note: Content Seeded with ChatGPT

For future use cases, I’ll use my own LLM safely with my own enterprise data.

https://www.youtube.com/watch?v=WBH9hYDyHKU

To Build and Run Your Own LLM

Introduction

In today’s data-driven world, organizations face the challenge of processing and analyzing vast amounts of data in real-time. To address this challenge, a robust and efficient data pipeline is crucial. In this comprehensive video tutorial, we will delve into the integration of MiNiFi, NiFi, Kafka, and Flink, four powerful open-source technologies, to build a real-time data pipeline that enables seamless data ingestion, processing, and analytics.

Understanding the Components

To lay a solid foundation, we will begin by introducing each component of the data pipeline:

  • MiNiFi: Learn about the lightweight counterpart of Apache NiFi, MiNiFi, designed for edge devices and IoT environments. Understand its role in efficient data collection and routing.
  • NiFi: Explore the core functionalities of Apache NiFi, a powerful data integration and flow management tool. Discover its intuitive user interface and the ability to create complex data pipelines with ease.
  • Kafka: Delve into Apache Kafka, a distributed streaming platform that provides fault-tolerant, scalable, and real-time data streaming capabilities. Understand its role as a highly efficient message queue and data streaming platform.
  • Flink: Discover Apache Flink, a fast and reliable stream processing framework. Explore Flink’s ability to process and analyze streaming data with low latency, fault tolerance, and support for event-time processing.

Designing the Real-Time Data Pipeline

Next, we will guide you through the step-by-step process of designing and building a real-time data pipeline using the integrated components:

  • Data Ingestion: Learn how to configure MiNiFi to collect data from various edge devices and route it to NiFi for further processing.
  • Data Routing, Transformation and Enrichment: Understand how to utilize NiFi’s powerful data transformation capabilities to cleanse, enrich, and manipulate the collected data as per your requirements.
  • Data Streaming with Kafka: Explore the integration of NiFi with Kafka, enabling seamless data streaming and ensuring fault tolerance and scalability.
  • Stream Processing with Flink SQL: Dive into Flink SQL and discover how to set up stream processing jobs to analyze and derive insights from the data flowing through the pipeline.

Implementing Real-World Use Cases

To demonstrate the practical application of the data pipeline, we will walk you through real-world use cases that showcase the power and versatility of the integrated technologies:

  • Internet of Things (IoT) Analytics: Explore how the data pipeline can handle real-time sensor data from IoT devices, process it in real-time, and extract valuable insights.
  • Social Media Sentiment Analysis: Discover how the pipeline can ingest and analyze social media data streams to extract sentiment analysis insights, enabling businesses to monitor brand reputation and customer sentiment.
  • Fraud Detection: Learn how the data pipeline can identify potential fraudulent activities by analyzing transaction data in real-time, minimizing losses for businesses.

Troubleshooting and Best Practices

Building a complex data pipeline requires attention to detail and consideration of best practices. We will address common challenges and provide troubleshooting tips to ensure a smooth implementation:

  • Handling Data Latency: Explore techniques to minimize data latency in the pipeline, ensuring that real-time insights are generated promptly.
  • Scalability and Performance Optimization: Discover strategies to optimize the performance and scalability of the data pipeline, enabling it to handle increasing data volumes.
  • Data Security and Governance: Understand the importance of data security and governance within the pipeline. Learn how to implement security measures and adhere to compliance regulations.

Getting Started

Today’s Data Sources — ADSB Planes and Breakout Garden Sensors

I am ingesting an ADSB feed via a REST service running on my Raspberry Pi 4 that has an ADSB antenna.

I have converted by Python application into a simpler MiNiFi agent that reads the REST JSON feed every 30 seconds and sends it to NiFi.

Data Source: Raspberry Pi with FlightAware Pro Stick Plus FA-PRO STICK PLUS-1 ADS-B USB Receiver with Built-in Filter

Data Source: Raspberry Pi with Pimoroni Breakout Garden, SGP30 Air Quality Sensor Breakout (TVOC/eCO2), ICP-10125: Ultra-precise Barometric Pressure and Temperature Sensor, SCD41 CO2 Sensor Breakout (Carbon Dioxide / Temperature / Humidity).

Step 1 — Install MiNiFi 1.2.x Agent (Java Edition)

On Raspberry Pi 4, install Java 8 (https://sdkman.io/), then install MiNiFi 1.2.x. https://www.apache.org/dyn/closer.lua?path=/nifi/1.21.0/minifi-1.21.0-bin.zip

Step 2 — Connect to EFM Server

On a server or Data Hub, install Cloudera Edge Flow Manager (or you can design MiNiFi flows with NiFi and use the NiFi -to- MiNiFi converter.

Step 3 — Design Edge Flow in EFM Design

It is similar to Cloudera Data Flow Designer as you drag over Processors and connect them. For my flow I am using InvokeHTTP to collect the local REST end-point from my ADSB website as JSON every 30 seconds.

I am also TailFile on a file populated line at a time with JSON records from a Python application reading sensors attached to the Pi.

These are both annotated via UpdateAttribute to include a user-agent that references their datatype (we could use schema name or another identifier).

Then both records are sent to my local NiFi server via InvokeHTTP, I could also use Remote Process Group, Kafka produce, MQTT produce or another TCP/IP or UDP protocol to communicate with NiFi. This is easiest and most portable.

See: https://github.com/tspannhw/FLaNK-Edge/tree/main/flows

Download Edge Flow Manager (CEM/EFM) Flow

curl -v — output flow.json http://nifi1:10090/efm/api/designer/rpi4thermal/flows/export

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 — : — : — — : — : — — : — : — 0* Trying 192.168.1.157:10090…
* Connected to nifi1 (192.168.1.157) port 10090 (#0)
> GET /efm/api/designer/rpi4thermal/flows/export HTTP/1.1
> Host: nifi1:10090
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Sun, 28 May 2023 01:38:33 GMT
< Set-Cookie: XSRF-TOKEN=dd9e041c-149b-43ea-89bc-d9f466b71fd8; Path=/efm
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: DENY
< Vary: Accept-Encoding, User-Agent
< Transfer-Encoding: chunked
<
{ [14128 bytes data]
100 1080k 0 1080k 0 0 15.6M 0 — : — : — — : — : — — : — : — 16.7M
* Connection #0 to host nifi1 left intact

Step 4— Edge Flow Deployment

Step 5— Receive Data in Apache NiFi

Universal REST and MiNiFi Receiver in NiFi
Parsing ADSB Plane Data 1
ADSB Plane Data Processing to Kafka

Sensors

Clean Sensor Data and Send to Kafka

Step 6— Schema for Data Contract and Quality

Step 7 — Send to Apache Kafka

Step 8— Query in Flink SQL via SQL Stream Builder

Sensor Data in Flink SQL
ADSB Plane Data in Flink SQL

SQL for Sensor Topic

select pisensor.co2, pisensor.cputempf, pisensor.equivalentco2ppm, pisensor.pressure, pisensor.temperature, pisensor.temperatureicp, pisensor.totalvocppb, pisensor.datetimestamp
from `schema_reg`.`default_database`.`pisensor`

SQL For ADSB Topic

select max(alt_baro) as MaxAltitudeFeet, min(alt_baro) as MinAltitudeFeet, avg(alt_baro) as AvgAltitudeFeet,max(alt_geom) as MaxGAltitudeFeet, min(alt_geom) as MinGAltitudeFeet, avg(alt_geom) as AvgGAltitudeFeet,max(gs) as MaxGroundSpeed, min(gs) as MinGroundSpeed, avg(gs) as AvgGroundSpeed,count(alt_baro) as RowCount,
hex as ICAO, flight as IDENT
from `schema_reg`.`default_database`.`adsb`
group by flight, hex

Conclusion

Hopefully you will have gained a deep understanding of how to design, build, and implement a real-time data pipeline using MiNiFi, NiFi, Kafka, and Flink. Equipped with this knowledge, you can harness the power of these integrated technologies to process, analyze, and derive valuable insights from your data in real-time.

With our universal listener in NiFi, we can add as many MiNiFi flows, agents or more and process them as they arrive.

References

Longer Reads

Events

https://www.linkedin.com/posts/cloudera-partners_llm-opensource-llms-activity-7064751460844015616-gF45?

https://web.cvent.com/event/7598f981-2f7e-4915-b662-bd7be9b5f48d/summary?RefId=homepage_impact24

https://www.cloudera.com/about/events/cloudera-now-cdp.html

May 30: https://tanzu.vmware.com/developer/tv/golden-path/35/

June 14: 12PM EDT Cloudera Now — Virtual

June 26–28, 2023: NLIT Summit. Milwaukee.

June 28, 2023: NiFi Meetup. Milwaukee and Hybrid.

July 19, 2023: 2-Hours to Data Innovation: Data Flow

October 18, 2023: 2-Hours to Data Innovation: Data Flow

Cloudera Events

More Events

--

--

Tim Spann
Cloudera

Principal Developer Advocate, Zilliz. Milvus, Attu, Towhee, GenAI, Big Data, IoT, Deep Learning, Streaming, Machine Learning. https://www.datainmotion.dev/