Open Source Data Infrastructure Meetup — Feb 2024

Tim Spann
Cloudera
Published in
6 min readFeb 9, 2024

Apache NiFi, Apache Kafka, Apache Flink, Postgresql, Python, GTFS

It was an amazing event with a great talk A Guide to Product Experimentation (Erin Mikail Staples, LaunchDarkly), this talk has made me want to improve my slide formatting and interactivity. When the video comes out, definitely check this out.

Building Real-time Pipelines: A Case Study with Transit Data Tim Spann
In this session, we will explore the powerful combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines. We will present a case study using the FLaNK-Transit projects, which leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA) and other transit systems like Halifax and Sao Paulo. By integrating Flink, NiFi, and Kafka, FLaNK-Transit projects demonstrate how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.

My talk following using Apache NiFi and Postgresql to ingest, enrich, transform, route and store data from various transit systems in real-time.

We use my GTFS Python feed to process the code.

I ran the demo on Apache NiFi 2.0.0-M2 on my Mac.

A number of people asked about getting data from databases when they change, here is a few examples and deep context.

For real-time local travel, we need to look at things like:

We also need to look at travel advisories of where we should or should not travel.

We need to check out planes, airports and the weather. Sometimes these require your own sensors, SDR and antennas.

In the demo, I showed Halifax:

I also should you Sao Paulo.

Then I showed grabbing all the public GTFS feeds available that don’t require a login.

These are all using GTFS format which is Protocol Buffers.

General Transit Feed Specification

We did the local MTA for buses.

Sometimes the sky turns brown and should we go outside. We will use real-time analytics to check with sensors and feeds.

If I missed anything, check this out:

MEETUP RESOURCES

MEETUP PHOTOS

--

--

Tim Spann
Cloudera

Principal Developer Advocate, Zilliz. Milvus, Attu, Towhee, GenAI, Big Data, IoT, Deep Learning, Streaming, Machine Learning. https://www.datainmotion.dev/