Building a Real-Time Data Pipeline: A Comprehensive Tutorial on MiNiFi, NiFi, Kafka, and Flink

Published in

Cloudera

13 min readMay 26, 2023

Tech: MiNiFi Java Agent, Java, Apache NiFi 1.20, Apache Kafka, Apache Flink, Cloudera SQL Stream Builder, Cloudera Streams Messaging Manager, Cloudera Edge Flow Manager.

Note: Content Seeded with ChatGPT

For future use cases, I’ll use my own LLM safely with my own enterprise data.

https://www.youtube.com/watch?v=WBH9hYDyHKU

To Build and Run Your Own LLM

GitHub - cloudera/CML_AMP_LLM_Chatbot_Augmented_with_Enterprise_Data

This repository demonstrates how to use an open source pre-trained instruction-following LLM (Large Language Model) to…

github.com

Introduction

In today’s data-driven world, organizations face the challenge of processing and analyzing vast amounts of data in real-time. To address this challenge, a robust and efficient data pipeline is crucial. In this comprehensive video tutorial, we will delve into the integration of MiNiFi, NiFi, Kafka, and Flink, four powerful open-source technologies, to build a real-time data pipeline that enables seamless data ingestion, processing, and analytics.

Understanding the Components

To lay a solid foundation, we will begin by introducing each component of the data pipeline:

MiNiFi: Learn about the lightweight counterpart of Apache NiFi, MiNiFi, designed for edge devices and IoT environments. Understand its role in efficient data collection and routing.
NiFi: Explore the core functionalities of Apache NiFi, a powerful data integration and flow management tool. Discover its intuitive user interface and the ability to create complex data pipelines with ease.
Kafka: Delve into Apache Kafka, a distributed streaming platform that provides fault-tolerant, scalable, and real-time data streaming capabilities. Understand its role as a highly efficient message queue and data streaming platform.
Flink: Discover Apache Flink, a fast and reliable stream processing framework. Explore Flink’s ability to process and analyze streaming data with low latency, fault tolerance, and support for event-time processing.

Designing the Real-Time Data Pipeline

Next, we will guide you through the step-by-step process of designing and building a real-time data pipeline using the integrated components:

Data Ingestion: Learn how to configure MiNiFi to collect data from various edge devices and route it to NiFi for further processing.
Data Routing, Transformation and Enrichment: Understand how to utilize NiFi’s powerful data transformation capabilities to cleanse, enrich, and manipulate the collected data as per your requirements.
Data Streaming with Kafka: Explore the integration of NiFi with Kafka, enabling seamless data streaming and ensuring fault tolerance and scalability.
Stream Processing with Flink SQL: Dive into Flink SQL and discover how to set up stream processing jobs to analyze and derive insights from the data flowing through the pipeline.

Implementing Real-World Use Cases

To demonstrate the practical application of the data pipeline, we will walk you through real-world use cases that showcase the power and versatility of the integrated technologies:

Internet of Things (IoT) Analytics: Explore how the data pipeline can handle real-time sensor data from IoT devices, process it in real-time, and extract valuable insights.
Social Media Sentiment Analysis: Discover how the pipeline can ingest and analyze social media data streams to extract sentiment analysis insights, enabling businesses to monitor brand reputation and customer sentiment.
Fraud Detection: Learn how the data pipeline can identify potential fraudulent activities by analyzing transaction data in real-time, minimizing losses for businesses.

Troubleshooting and Best Practices

Building a complex data pipeline requires attention to detail and consideration of best practices. We will address common challenges and provide troubleshooting tips to ensure a smooth implementation:

Handling Data Latency: Explore techniques to minimize data latency in the pipeline, ensuring that real-time insights are generated promptly.
Scalability and Performance Optimization: Discover strategies to optimize the performance and scalability of the data pipeline, enabling it to handle increasing data volumes.
Data Security and Governance: Understand the importance of data security and governance within the pipeline. Learn how to implement security measures and adhere to compliance regulations.

Getting Started

Installing CSP Community Edition

You need to access the Downloads Page of Cloudera Stream Processing (CSP) to download the Community Edition version of…

docs.cloudera.com

Apache NiFi Downloads

Apache NiFi Project Keys can be used to verify downloads. Please allow up to 24 hours for mirrors to synchronize…

nifi.apache.org

Cloudera Flink Tutorials

The Cloudera Flink Tutorials walks you through the basic steps to create a Stateless Monitoring, a Stateful Inventory…

docs.cloudera.com

GitHub - cloudera/flink-tutorials

This repo contains reference Flink Streaming applications for a few example use-cases. These examples should serve as…

github.com

Apache Nifi for Dummies

Apache NiFi is an integrated data logistics and simple event processing platform.

www.cloudera.com

Today’s Data Sources — ADSB Planes and Breakout Garden Sensors

I am ingesting an ADSB feed via a REST service running on my Raspberry Pi 4 that has an ADSB antenna.

I have converted by Python application into a simpler MiNiFi agent that reads the REST JSON feed every 30 seconds and sends it to NiFi.

GitHub - tspannhw/pulsar-adsb-function: Apache Pulsar Function to parse ADS-B Aircraft JSON

Apache Pulsar Function to parse ADS-B Aircraft JSON - GitHub - tspannhw/pulsar-adsb-function: Apache Pulsar Function to…

github.com

GitHub - tspannhw/FLiP-Py-ADS-B: Using Apache Pulsar with ADSB-B Feeds

Using Apache Pulsar with ADSB-B Feeds. Contribute to tspannhw/FLiP-Py-ADS-B development by creating an account on…

github.com

GitHub - tspannhw/FLaNK-Edge: An example of FLaNK Edge

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Data Source: Raspberry Pi with FlightAware Pro Stick Plus FA-PRO STICK PLUS-1 ADS-B USB Receiver with Built-in Filter

Data Source: Raspberry Pi with Pimoroni Breakout Garden, SGP30 Air Quality Sensor Breakout (TVOC/eCO2), ICP-10125: Ultra-precise Barometric Pressure and Temperature Sensor, SCD41 CO2 Sensor Breakout (Carbon Dioxide / Temperature / Humidity).

Step 1 — Install MiNiFi 1.2.x Agent (Java Edition)

On Raspberry Pi 4, install Java 8 (https://sdkman.io/), then install MiNiFi 1.2.x. https://www.apache.org/dyn/closer.lua?path=/nifi/1.21.0/minifi-1.21.0-bin.zip

Release Notes

Version 1.21.0 of Apache NiFi is an improvement, bug fix and security focused release. Release Date: April 7, 2023…

cwiki.apache.org

Step 2 — Connect to EFM Server

On a server or Data Hub, install Cloudera Edge Flow Manager (or you can design MiNiFi flows with NiFi and use the NiFi -to- MiNiFi converter.

Edge Flow Manager

Edge Flow Manager (EFM) is a management hub that supports a GUI-based tool to manage, control, and monitor MiNiFi…

docs.cloudera.com

Apache NiFi MiNiFi Toolkit Guide

MiNiFi is a child project effort of Apache NiFi. The MiNiFi toolkit aids in creating MiNiFi configuration files from…

nifi.apache.org

Step 3 — Design Edge Flow in EFM Design

It is similar to Cloudera Data Flow Designer as you drag over Processors and connect them. For my flow I am using InvokeHTTP to collect the local REST end-point from my ADSB website as JSON every 30 seconds.

I am also TailFile on a file populated line at a time with JSON records from a Python application reading sensors attached to the Pi.

These are both annotated via UpdateAttribute to include a user-agent that references their datatype (we could use schema name or another identifier).

Then both records are sent to my local NiFi server via InvokeHTTP, I could also use Remote Process Group, Kafka produce, MQTT produce or another TCP/IP or UDP protocol to communicate with NiFi. This is easiest and most portable.

See: https://github.com/tspannhw/FLaNK-Edge/tree/main/flows

Download Edge Flow Manager (CEM/EFM) Flow

curl -v — output flow.json http://nifi1:10090/efm/api/designer/rpi4thermal/flows/export
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 — : — : — — : — : — — : — : — 0* Trying 192.168.1.157:10090…
* Connected to nifi1 (192.168.1.157) port 10090 (#0)
> GET /efm/api/designer/rpi4thermal/flows/export HTTP/1.1
> Host: nifi1:10090
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Sun, 28 May 2023 01:38:33 GMT
< Set-Cookie: XSRF-TOKEN=dd9e041c-149b-43ea-89bc-d9f466b71fd8; Path=/efm
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: DENY
< Vary: Accept-Encoding, User-Agent
< Transfer-Encoding: chunked
<
{ [14128 bytes data]
100 1080k 0 1080k 0 0 15.6M 0 — : — : — — : — : — — : — : — 16.7M
* Connection #0 to host nifi1 left intact

Step 4— Edge Flow Deployment

Step 5— Receive Data in Apache NiFi

Universal REST and MiNiFi Receiver in NiFi

Sensors

Step 6— Schema for Data Contract and Quality

Step 7 — Send to Apache Kafka

Step 8— Query in Flink SQL via SQL Stream Builder

SQL for Sensor Topic

select pisensor.co2, pisensor.cputempf, pisensor.equivalentco2ppm, pisensor.pressure, pisensor.temperature, pisensor.temperatureicp, pisensor.totalvocppb, pisensor.datetimestamp
from `schema_reg`.`default_database`.`pisensor`

SQL For ADSB Topic

select max(alt_baro) as MaxAltitudeFeet, min(alt_baro) as MinAltitudeFeet, avg(alt_baro) as AvgAltitudeFeet,max(alt_geom) as MaxGAltitudeFeet, min(alt_geom) as MinGAltitudeFeet, avg(alt_geom) as AvgGAltitudeFeet,max(gs) as MaxGroundSpeed, min(gs) as MinGroundSpeed, avg(gs) as AvgGroundSpeed,count(alt_baro) as RowCount,
hex as ICAO, flight as IDENT
from `schema_reg`.`default_database`.`adsb`
group by flight, hex

Conclusion

Hopefully you will have gained a deep understanding of how to design, build, and implement a real-time data pipeline using MiNiFi, NiFi, Kafka, and Flink. Equipped with this knowledge, you can harness the power of these integrated technologies to process, analyze, and derive valuable insights from your data in real-time.

With our universal listener in NiFi, we can add as many MiNiFi flows, agents or more and process them as they arrive.

References

Tracking Aircraft in Real-Time With Open Source

Use Case: Automatic Dependent Surveillance-Broadcast (ADS-B Analytics)

medium.com

GitHub - tspannhw/pulsar-adsb-function: Apache Pulsar Function to parse ADS-B Aircraft JSON

Apache Pulsar Function to parse ADS-B Aircraft JSON - GitHub - tspannhw/pulsar-adsb-function: Apache Pulsar Function to…

github.com

GitHub - tspannhw/FLiP-Py-ADS-B: Using Apache Pulsar with ADSB-B Feeds

Using Apache Pulsar with ADSB-B Feeds. Contribute to tspannhw/FLiP-Py-ADS-B development by creating an account on…

github.com

Ingesting Flight Data ADS-B USB Receiver With Apache NiFi 1.5+ - DZone

I'm using the FlightAware Pro Stick plus an ADS-B USB receiver with a built-in filter on a Mac. I should hook this up…

dzone.com

Parsing Weather Feeds to Add to Real-Time Streams

Apache Pulsar — Weather — Apache Flink — SQL — Continuos Analytics — Java

medium.com

Ingesting Flight Data ADS-B USB Receiver with Apache NiFi 1.5+

Ingest All The Things Series: Flight Data Via Radio I am using the FlightAware Pro Stick Plus ADS-B USB Receiver with…

community.cloudera.com

GitHub - tspannhw/FLiPStackWeekly: FLiP Stack Weekly covering Apache NiFi, Apache Flink, Apache…

FLiP Stack Weekly covering Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Apache Ozone, Apache…

github.com

Creating Apache NiFi — Apache Pulsar — Apache Flink Apps (Citibikenyc data)

create-nifi-pulsar-flink-apps

medium.com

Tutorial: Build Your First Streaming Application Now!!

Using Apache NiFi, Apache Pulsar, and Apache Flink for streaming apps

medium.com

Building a Travel Advisory app with Apache NiFi in K8

FLaNK-TravelAdvisory

medium.com

Ingest From Iceberg Tables with Cloudera DataFlow

See: https://github.com/tspannhw/FLaNK-DataFlows/blob/main/jdbc/README.md

medium.com

Deploy Your NiFi Flow to K8 in AWS

AWS — Apache NiFi — Cloudera Dataflow — K8 — Streaming

medium.com

Using NiFi As an SMTP (Email) Test Server

Yes, it is true. Email can act as a basic, no security, receiver of SMTP traffic; it can then pass it along. So besides…

medium.com

Finding the Best Way Around

Utilizing Real-Time Transit Data for Travel Optimization.

medium.com

Building Real-Time Schema Pipelines from Messaging Topics

Code: https://github.com/tspannhw/pulsar-admintool

medium.com

PhillyJug Getting Started With Real-time Cloud Native Streaming With Java

© 2023 Cloudera, Inc. All rights reserved. Getting Started With Real-time Cloud Native Streaming With Java Apache…

www.slideshare.net

ITPC Building Modern Data Streaming Apps

© 2023 Cloudera, Inc. All rights reserved. Building Modern Data Streaming Apps Tim Spann Principal Developer Advocate…

www.slideshare.net

WarsawITDays_ ApacheNiFi202

MAIN ORGANIZER: ORGANIZING COMMITTEE: dozens of organizations from the IT / data science sector (full list on the event…

www.slideshare.net

Meetup: Streaming Data Pipeline Development

www.slideshare.net

BigDataFest_ Building Modern Data Streaming Apps

Building Modern Data Streaming Apps Tim Spann Principal Developer Advocate 25-May-2023 4 FLaNK Stack Tim Spann @PaasDev…

www.slideshare.net

How to Run NiFi Flows in Kafka KConnect

This may have been caused by one of the following: Your request timed out A plugin/browser extension blocked the…

www.cloudera.com

Longer Reads

Data Distribution Architecture to Drive Innovation

Read this ebook to learn how you can simplify your data pipelines. You'll learn how you can support active and passive…

www.cloudera.com

Creating a flow deployment

You are all set up for creating your first DataFlow deployment: you have a flow definition created in and downloaded…

docs.cloudera.com

Leveraging Data Analytics in the Fight Against Prescription Opioid Abuse - Cloudera Blog

Every day in the US thousands of legitimate prescriptions for the opioid class of pharmaceuticals are written to…

blog.cloudera.com

Self Service is Simply Efficient - Cloudera DataFlow Designer GA announcement - Cloudera Blog

We are thrilled to announce that the new DataFlow Designer is now generally available to all CDP Public Cloud…

blog.cloudera.com

Introduction to DataFlow Flow Designer | Cloudera

This may have been caused by one of the following: Your request timed out A plugin/browser extension blocked the…

www.cloudera.com

Real-Time Transit Feed Data Processing With Apache NiFi - DZone

To facilitate ingesting GTFS Real-Time data, I have added a processor that converts GTFS (General Transit Feed…

dzone.com

Smart Stocks With NiFi, Kafka, and Flink SQL - DZone

I would like to track stocks from some companies frequently during the day using Apache NiFi to read the REST API…

dzone.com

Introducing Cloudera SQL Stream Builder (SSB) - DZone

The initial release of Cloudera SQL Stream Builder as part of the CSA 1.3.0 release of Apache Flink and friends from…

dzone.com

Real-Time Streaming Deep Learning Pipelines With DJL and Apache NiFi - DZone

I will be talking about this processor at Apache Con @ Home 2020 in my "Apache Deep Learning 301" talk with Dr. Ian…

dzone.com

Events

https://www.linkedin.com/posts/cloudera-partners_llm-opensource-llms-activity-7064751460844015616-gF45?

Evolve 2023 | Live Events Globally | Cloudera

This may have been caused by one of the following: Your request timed out A plugin/browser extension blocked the…

www.cloudera.com

https://web.cvent.com/event/7598f981-2f7e-4915-b662-bd7be9b5f48d/summary?RefId=homepage_impact24

https://www.cloudera.com/about/events/cloudera-now-cdp.html

May 30: https://tanzu.vmware.com/developer/tv/golden-path/35/

June 14: 12PM EDT Cloudera Now — Virtual

Cloudera Now Virtual Event | Cloudera

Cloudera Now, our quarterly virtual event, is designed for IT and data practitioners who want to hear the latest and…

www.cloudera.com

June 26–28, 2023: NLIT Summit. Milwaukee.

NLIT Summit 2023

The NLIT Summit is sponsored by the NLIT Society, a professional society founded to facilitate the exchange of best…

www.fbcinc.com

June 28, 2023: NiFi Meetup. Milwaukee and Hybrid.

Meat Meetup: Streaming Data Pipeline Development, Wed, Jun 28, 2023, 5:30 PM | Meetup

Details This will be a hybrid event with a Zoom. The in-person event will be in Milwaukee. In this interactive session…

www.meetup.com

July 19, 2023: 2-Hours to Data Innovation: Data Flow

Hands-on-Lab Series 2-Hours to Data Innovation

Join an instructor-led, virtual hands-on-lab to learn how you can transform and innovate your business with data and…

www.cloudera.com

October 18, 2023: 2-Hours to Data Innovation: Data Flow

Hands-on-Lab Series 2-Hours to Data Innovation

Join an instructor-led, virtual hands-on-lab to learn how you can transform and innovate your business with data and…

www.cloudera.com

Cloudera Events

Cloudera tradeshows, conferences, and roadshows

This may have been caused by one of the following: Your request timed out A plugin/browser extension blocked the…

www.cloudera.com

More Events