Building a Real-Time Data Pipeline: A Comprehensive Tutorial on MiNiFi, NiFi, Kafka, and Flink
Tech: MiNiFi Java Agent, Java, Apache NiFi 1.20, Apache Kafka, Apache Flink, Cloudera SQL Stream Builder, Cloudera Streams Messaging Manager, Cloudera Edge Flow Manager.
Note: Content Seeded with ChatGPT
For future use cases, I’ll use my own LLM safely with my own enterprise data.
https://www.youtube.com/watch?v=WBH9hYDyHKU
To Build and Run Your Own LLM
Introduction
In today’s data-driven world, organizations face the challenge of processing and analyzing vast amounts of data in real-time. To address this challenge, a robust and efficient data pipeline is crucial. In this comprehensive video tutorial, we will delve into the integration of MiNiFi, NiFi, Kafka, and Flink, four powerful open-source technologies, to build a real-time data pipeline that enables seamless data ingestion, processing, and analytics.
Understanding the Components
To lay a solid foundation, we will begin by introducing each component of the data pipeline:
- MiNiFi: Learn about the lightweight counterpart of Apache NiFi, MiNiFi, designed for edge devices and IoT environments. Understand its role in efficient data collection and routing.
- NiFi: Explore the core functionalities of Apache NiFi, a powerful data integration and flow management tool. Discover its intuitive user interface and the ability to create complex data pipelines with ease.
- Kafka: Delve into Apache Kafka, a distributed streaming platform that provides fault-tolerant, scalable, and real-time data streaming capabilities. Understand its role as a highly efficient message queue and data streaming platform.
- Flink: Discover Apache Flink, a fast and reliable stream processing framework. Explore Flink’s ability to process and analyze streaming data with low latency, fault tolerance, and support for event-time processing.
Designing the Real-Time Data Pipeline
Next, we will guide you through the step-by-step process of designing and building a real-time data pipeline using the integrated components:
- Data Ingestion: Learn how to configure MiNiFi to collect data from various edge devices and route it to NiFi for further processing.
- Data Routing, Transformation and Enrichment: Understand how to utilize NiFi’s powerful data transformation capabilities to cleanse, enrich, and manipulate the collected data as per your requirements.
- Data Streaming with Kafka: Explore the integration of NiFi with Kafka, enabling seamless data streaming and ensuring fault tolerance and scalability.
- Stream Processing with Flink SQL: Dive into Flink SQL and discover how to set up stream processing jobs to analyze and derive insights from the data flowing through the pipeline.
Implementing Real-World Use Cases
To demonstrate the practical application of the data pipeline, we will walk you through real-world use cases that showcase the power and versatility of the integrated technologies:
- Internet of Things (IoT) Analytics: Explore how the data pipeline can handle real-time sensor data from IoT devices, process it in real-time, and extract valuable insights.
- Social Media Sentiment Analysis: Discover how the pipeline can ingest and analyze social media data streams to extract sentiment analysis insights, enabling businesses to monitor brand reputation and customer sentiment.
- Fraud Detection: Learn how the data pipeline can identify potential fraudulent activities by analyzing transaction data in real-time, minimizing losses for businesses.
Troubleshooting and Best Practices
Building a complex data pipeline requires attention to detail and consideration of best practices. We will address common challenges and provide troubleshooting tips to ensure a smooth implementation:
- Handling Data Latency: Explore techniques to minimize data latency in the pipeline, ensuring that real-time insights are generated promptly.
- Scalability and Performance Optimization: Discover strategies to optimize the performance and scalability of the data pipeline, enabling it to handle increasing data volumes.
- Data Security and Governance: Understand the importance of data security and governance within the pipeline. Learn how to implement security measures and adhere to compliance regulations.
Getting Started
Today’s Data Sources — ADSB Planes and Breakout Garden Sensors
I am ingesting an ADSB feed via a REST service running on my Raspberry Pi 4 that has an ADSB antenna.
I have converted by Python application into a simpler MiNiFi agent that reads the REST JSON feed every 30 seconds and sends it to NiFi.
Data Source: Raspberry Pi with FlightAware Pro Stick Plus FA-PRO STICK PLUS-1 ADS-B USB Receiver with Built-in Filter
Data Source: Raspberry Pi with Pimoroni Breakout Garden, SGP30 Air Quality Sensor Breakout (TVOC/eCO2), ICP-10125: Ultra-precise Barometric Pressure and Temperature Sensor, SCD41 CO2 Sensor Breakout (Carbon Dioxide / Temperature / Humidity).
Step 1 — Install MiNiFi 1.2.x Agent (Java Edition)
On Raspberry Pi 4, install Java 8 (https://sdkman.io/), then install MiNiFi 1.2.x. https://www.apache.org/dyn/closer.lua?path=/nifi/1.21.0/minifi-1.21.0-bin.zip
Step 2 — Connect to EFM Server
On a server or Data Hub, install Cloudera Edge Flow Manager (or you can design MiNiFi flows with NiFi and use the NiFi -to- MiNiFi converter.
Step 3 — Design Edge Flow in EFM Design
It is similar to Cloudera Data Flow Designer as you drag over Processors and connect them. For my flow I am using InvokeHTTP to collect the local REST end-point from my ADSB website as JSON every 30 seconds.
I am also TailFile on a file populated line at a time with JSON records from a Python application reading sensors attached to the Pi.
These are both annotated via UpdateAttribute to include a user-agent that references their datatype (we could use schema name or another identifier).
Then both records are sent to my local NiFi server via InvokeHTTP, I could also use Remote Process Group, Kafka produce, MQTT produce or another TCP/IP or UDP protocol to communicate with NiFi. This is easiest and most portable.
See: https://github.com/tspannhw/FLaNK-Edge/tree/main/flows
Download Edge Flow Manager (CEM/EFM) Flow
curl -v — output flow.json http://nifi1:10090/efm/api/designer/rpi4thermal/flows/export
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 — : — : — — : — : — — : — : — 0* Trying 192.168.1.157:10090…
* Connected to nifi1 (192.168.1.157) port 10090 (#0)
> GET /efm/api/designer/rpi4thermal/flows/export HTTP/1.1
> Host: nifi1:10090
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Sun, 28 May 2023 01:38:33 GMT
< Set-Cookie: XSRF-TOKEN=dd9e041c-149b-43ea-89bc-d9f466b71fd8; Path=/efm
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: DENY
< Vary: Accept-Encoding, User-Agent
< Transfer-Encoding: chunked
<
{ [14128 bytes data]
100 1080k 0 1080k 0 0 15.6M 0 — : — : — — : — : — — : — : — 16.7M
* Connection #0 to host nifi1 left intact
Step 4— Edge Flow Deployment
Step 5— Receive Data in Apache NiFi
Sensors
Step 6— Schema for Data Contract and Quality
Step 7 — Send to Apache Kafka
Step 8— Query in Flink SQL via SQL Stream Builder
SQL for Sensor Topic
select pisensor.co2, pisensor.cputempf, pisensor.equivalentco2ppm, pisensor.pressure, pisensor.temperature, pisensor.temperatureicp, pisensor.totalvocppb, pisensor.datetimestamp
from `schema_reg`.`default_database`.`pisensor`
SQL For ADSB Topic
select max(alt_baro) as MaxAltitudeFeet, min(alt_baro) as MinAltitudeFeet, avg(alt_baro) as AvgAltitudeFeet,max(alt_geom) as MaxGAltitudeFeet, min(alt_geom) as MinGAltitudeFeet, avg(alt_geom) as AvgGAltitudeFeet,max(gs) as MaxGroundSpeed, min(gs) as MinGroundSpeed, avg(gs) as AvgGroundSpeed,count(alt_baro) as RowCount,
hex as ICAO, flight as IDENT
from `schema_reg`.`default_database`.`adsb`
group by flight, hex
Conclusion
Hopefully you will have gained a deep understanding of how to design, build, and implement a real-time data pipeline using MiNiFi, NiFi, Kafka, and Flink. Equipped with this knowledge, you can harness the power of these integrated technologies to process, analyze, and derive valuable insights from your data in real-time.
With our universal listener in NiFi, we can add as many MiNiFi flows, agents or more and process them as they arrive.
References
Longer Reads
Events
https://web.cvent.com/event/7598f981-2f7e-4915-b662-bd7be9b5f48d/summary?RefId=homepage_impact24
https://www.cloudera.com/about/events/cloudera-now-cdp.html
May 30: https://tanzu.vmware.com/developer/tv/golden-path/35/
June 14: 12PM EDT Cloudera Now — Virtual
June 26–28, 2023: NLIT Summit. Milwaukee.
June 28, 2023: NiFi Meetup. Milwaukee and Hybrid.
July 19, 2023: 2-Hours to Data Innovation: Data Flow
October 18, 2023: 2-Hours to Data Innovation: Data Flow
Cloudera Events
More Events