Future of Data Streaming — A Look at 2024

Tim Spann
13 min readDec 21, 2023

--

What does the near future hold in store? NiFi — Kafka — Flink — More…

In 2023, we saw Apache Flink drive acquisitions, strategy and go beyond just the Uber-Internet scale giants. Flink is backed and hosted at Cloudera, Confluent, Ververica, Decodable, Amazon, Aiven, Azure, Alibaba and more.

Keywords to remember for the future: Apache Flink, RisingWave, Streaming Data Lake Platform, Apache Paimon, Apache Kafka, Apache NiFi, Clickhouse+, Apache Iceberg, LLM, Generative AI

After seeing how well DuckDB is doing, it is probably time to look at small one node streaming. Apache NiFi and MiNiFi do that extremely well. I am not sure a full-sized Kafka or Flink fits as well. This may need some tweaking and innovation. Perhaps RedPanda or a single Docker instance.

I made predictions for 2023 about Pulsar, Apache Flink SQL as a Service, NiFi as a Service, Pinot, ML Streaming, Streaming Edge Devices and MQTT. All have been slowly rising and still growing for 2024. Flink SQL as a Service was everywhere and is still growing.

RisingWave

With RisingWave being open source and pushing the word of streaming database, this category, open source RisingWave and cloud hosted RisingWave will continue to expand and be everywhere. There are few reasons not to have this as a stop on your streaming pipeline. Also the RisingWave team are some really cool people and this technology looks very cool.

Apache NiFi Beyond ELT

In 2024, NiFi will be in version 2 running on JDK 21+ and supporting Python 3 components. NiFi continues to enhance, add stateless features and run everywhere. The newest version are game changing for Generative AI and Open Data Lakehouse pipelines. Drag and Drop is the only way to quickly build streaming pipeline sometimes thousands of them a day. NiFi will be the way to do streaming machine learning, deep learning and generative AI. Get your LLM (Large Language Models) and Vector databases (Milvus, Pinecone, …) ready for a ton of data.

Streaming Data Lake Platform

It is time to kick the tires with Apache Paimon. This is an interesting idea. For now NiFI + Kafka + Flink + Iceberg is a hard combination to beat.

My friend, Giannis Polyzos, is doing good work promoting this interesting technology.

Kafka Compatible

More and more Kafka compatible systems are here, but even more are coming. With the Kafka protocol being key, but how and where data is stored will vary. From standard disk to object storage to distributed datastores, with more variants on the way. I expect to see more language variance for Kafka compatible messaging.

Never underestimate Apache Pulsar here as performance and features continue to grow at a pace much faster than Apache Kafka.

Clickhouse +

Clickhouse seems to be at the end of a lot of Kafka topics and there are interesting enhancements going on. I am not sure what’s next, but there is potential. I have seen a number of organizations move from Druid to Clickhouse. This is one to keep an eye on.

Apache Pinot

If you weren’t clicking in the house, you were sipping the Pinot at tremendous scale. I think the interesting part in a lot of technologies is at what size do Clickhouse, Pinot and others start to become superstars. It’s the small and mid-size use cases that Snowflake loves that tend to slip through the tracks.

I did a few meetups with the StarTree people behind Pinot and they are pretty awesome. Some of my favorite people in technology work at StarTree spreading the good tech that is Pinot.

Open Data Lakehouse

Apache Iceberg, Apache Hudi and DeltaLake are driving more data to open and massive storage. This storage is needed to augment, train and feed generative AI and machine learning in general. This will expand as more and more streaming pipelines add Lakehouses as an important stop.

The Open Source Data Lakehouse built on open table formats that allow for data evolution and time travel will continue to grow in 2024. This is the best way to do general datalake or datawarehouse now. You augment this with Kafka, Flink, Spark, NiFi, Kudu, Hive, Impala, HBase, Presto, Pinot, Clickhouse and you are set.

Edge

This may seem like a broken record, but edge and IoT are important. From containers and devices to industrial systems and automated logs, data streaming from beyond the cloud and beyond the data center is here and will continue to grow.

From the simple of a sensor that pushes messages to MQTT to complex agents running as MiNiFi agents that run local machine learning and process data at the edge, there are a lot of options for streaming data out of these devices. A simple one is to have a Java, Rust, Go or Python app that streams sensors readings over Kafka to a local edge server gateway, cloud or data center Kafka cluster.

LLM / Generative AI

In 2023, we started experimenting with using streaming to feed models, vectorize data to vector databases, build prompts and run classification via REST endpoints. In the coming year, this will expand as tools like Apache NiFi can load text, Google Sheets, PDFs, Word documents, RTFs, Emails, images, videos and more. All data types are on the table. This will get a ton of investment and interesting new models and use cases will start appearing. This is exciting.

This project is super exciting.

Pipelines

More and more streaming pipelines will have Open Lakehouses and Generative AI as parts of their live events. Pipelines are hard and growing, but tools are coming to help. With NiFi improving that is an obvious choice, but Flink is another. With Nussknacker you can build Flink pipelines much easier.

Another intereseting option for Flink is DataSQRL.

There are also more coming and we will see what emerges as the leader in 2024.

More Conferences Focusing on Streaming

I see not just Pulsar Summit, Current Event (Kafka), Flink Forward and Kafka Summit. More streaming conferences are coming and streaming tracks will expand at Community Over Code and more. Streaming plus Generative AI will be interesting in the next five years. I expect to see some interesting meetups and events during 2024. And if you run one of these, invite me to speak!

Azure Generated

Now Google Bard’s Take

Future of Data Streaming — A Look at 2024

In the information age, data is the new oil, and its value lies in its real-time processing and utilization. Data streaming, the continuous ingestion and processing of data in real-time, has emerged as a crucial tool for businesses to gain insights, make informed decisions, and react swiftly to changing conditions. As we approach 2024, several exciting trends are shaping the future of data streaming:

1. Serverless Stream Processing: This trend simplifies building and managing streaming applications by leveraging the power of cloud-based platforms. Developers don’t need to worry about infrastructure management, scaling, or resource allocation, allowing them to focus on building innovative applications.

2. Data Fabric for AI: Data streaming will become a cornerstone of AI infrastructure. Real-time data will be used to train AI models, perform real-time model scoring, and integrate with various AI services like large language models (LLMs) and SaaS offerings. This will enable businesses to make smarter and faster decisions based on real-time insights.

3. Embracing Multi-Cloud: Businesses will utilize multiple cloud platforms to take advantage of the unique strengths and functionalities of each provider. This requires streaming solutions that are cloud-agnostic and offer seamless integration with different cloud environments.

4. Edge Computing: Data will be processed and analyzed closer to its source, at the edge of the network, for faster and more efficient decision-making. This is particularly relevant for applications with low latency requirements, such as industrial automation and autonomous vehicles.

5. Democratization of Data Streaming: Streaming technologies will become more accessible and user-friendly, allowing businesses of all sizes to leverage their power without needing extensive technical expertise. Open-source frameworks like Apache Kafka and Apache Flink will play a key role in this trend.

6. Security and Privacy: As data streaming becomes increasingly prevalent, concerns around security and privacy will become paramount. Streaming solutions will need to be built with robust security measures and adhere to strict data privacy regulations.

7. Interoperability and Standardization: Efforts will be made to standardize streaming technologies and protocols to improve interoperability between different solutions. This will enable businesses to build more flexible and scalable streaming architectures.

8. The Rise of Streaming Analytics: Advanced analytics tools will be developed to extract valuable insights from real-time data streams. These tools will help businesses identify trends, predict future events, and optimize their operations in real-time.

9. Real-time Integration with Business Applications: Streaming data will be integrated with business applications in real-time, enabling businesses to make data-driven decisions across various departments, such as marketing, sales, and customer service.

10. The Rise of Stream Native Applications: Traditional applications will be gradually replaced by stream native applications designed specifically to handle and process data streams. These applications will be more efficient, scalable, and resilient, making them ideal for the demands of modern businesses.

Overall, the future of data streaming is incredibly promising. New technologies, innovative applications, and increased accessibility will drive widespread adoption across various industries. Data streaming will be a key driver of digital transformation, enabling businesses to unlock the full potential of real-time data and achieve competitive advantage in the dynamic landscape of the information age.

Here are some additional trends to watch out for in 2024:

  • The rise of stream processing in the cloud: More businesses will move their streaming workloads to the cloud for scalability, elasticity, and easier management.
  • The growth of real-time data pipelines: Businesses will build more complex data pipelines that capture, process, and analyze data streams in real-time.
  • The emergence of edge-to-cloud streaming solutions: Businesses will leverage the combined power of edge and cloud computing to handle diverse data streaming scenarios.
  • The increasing use of streaming for machine learning: Streaming data will play a crucial role in training and deploying machine learning models for various applications.
  • The growing importance of data governance for streaming: Businesses will implement robust data governance frameworks to ensure data security, privacy, and compliance.

With its transformative power and endless possibilities, data streaming is set to revolutionize the way we collect, process, and analyze data, paving the way for a data-driven future.

Now OpenAI ChatGPT’s Take

Title: The Future of Data Streaming: A Glimpse into 2024

In the fast-paced world of technology, data is the lifeblood that fuels innovation and drives decision-making. As we step into the year 2024, the landscape of data streaming is undergoing a transformation that promises to revolutionize the way we collect, process, and utilize information. In this blog post, we will explore the key trends and advancements shaping the future of data streaming.

1. Real-time Analytics Evolution:

In 2024, real-time analytics will reach new heights, thanks to the continuous evolution of data streaming technologies. Businesses are increasingly recognizing the importance of making decisions based on up-to-the-minute information, and data streaming is at the forefront of enabling this shift. Whether it’s monitoring user behavior on a website, analyzing social media trends, or processing sensor data in real time, the demand for instantaneous insights is driving the development of more efficient and scalable data streaming platforms.

2. Edge Computing Integration:

The integration of edge computing with data streaming is set to become a game-changer in 2024. Edge computing involves processing data closer to the source, reducing latency and enhancing overall system performance. By combining the power of edge computing with data streaming, organizations can make quicker and more informed decisions, particularly in scenarios where low latency is critical, such as autonomous vehicles, industrial IoT, and augmented reality applications.

3. Enhanced Security Measures:

As the volume of data being streamed continues to grow, so does the need for robust security measures. In 2024, we can expect to see advancements in encryption, authentication, and access control within data streaming platforms. Privacy concerns and the increasing sophistication of cyber threats are pushing developers to implement end-to-end encryption and other security protocols to ensure the integrity and confidentiality of streaming data.

4. Convergence of AI and Data Streaming:

Artificial Intelligence (AI) and machine learning algorithms are becoming integral components of data streaming systems. In 2024, we will witness the convergence of AI and data streaming technologies to enhance predictive analytics, anomaly detection, and automated decision-making processes. This synergy will empower organizations to extract valuable insights from streaming data, uncover patterns, and optimize operations in real time.

5. Cross-Platform Compatibility:

The future of data streaming lies in its ability to seamlessly integrate with diverse platforms and technologies. In 2024, we can anticipate the development of more standardized and interoperable data streaming protocols. This will facilitate the exchange of information across different systems, making it easier for organizations to adopt and scale their data streaming solutions without compatibility issues.

6. Sustainability and Green Data Streaming:

Environmental consciousness is influencing every facet of technology, and data streaming is no exception. In 2024, there will be a growing emphasis on developing sustainable and energy-efficient data streaming solutions. Innovations such as low-power devices, optimized algorithms, and eco-friendly data centers will contribute to a greener and more sustainable data streaming ecosystem.

In conclusion, the future of data streaming in 2024 is marked by a convergence of cutting-edge technologies, a heightened focus on security and sustainability, and an increased demand for real-time analytics. As businesses strive to stay ahead in a data-driven world, embracing these trends will be crucial for unlocking the full potential of data streaming and driving innovation across industries.

See also:

--

--

Tim Spann

Principal Developer Advocate, Zilliz. Milvus, Attu, Towhee, GenAI, Big Data, IoT, Deep Learning, Streaming, Machine Learning. https://www.datainmotion.dev/