Apache Pulsar as One Storage System for Both Real-time and Historical Data Analysis

Yijie Shen
StreamNative
Published in
9 min readJul 16, 2019

This blog introduces the trend in processing engines to unify streaming and batch executions and explains the challenges for existing messaging systems to fulfill the needs of unified storage. Then it explains why Apache Pulsar fits such demands best as a one-stop solution. At last, this blog uses Pulsar Spark Connector as an example to shows how Apache Pulsar stores all real-time data in one single system and how it supports data analysis on the full-time range.

The state-of-the-art real-time data storage and processing approach

In the field of massively parallel data analysis, AMPLab’s “One stack to rule them all” proposes to use Apache Spark as a unified engine to support all commonly used data processing scenarios, such as batch processing, stream processing, interactive query, and machine learning. Structured streaming is the new Apache Spark API released in Spark 2.2.0 that lets you express computation on streaming data in the same way you express a batch computation on static data, and Spark SQL engine performs a wide range of optimizations for both scenarios internally.

On the other hand, Apache Flink gets in the public eye around 2016 with many appealing features, for example, better stream processing support at the time, the built-in watermark support, and exactly-once semantics. Flink has quickly become a strong competitor for Spark. Regardless of the platform, users nowadays are more concerned about how to quickly discover the value of data. Streaming data and static data are no longer separate entities, but two different representations through the data lifecycle.

A natural idea arises: can I keep all streaming data in messaging systems as they are collected? For traditional systems, the answer is no. Take Apache Kafka as an example, in Kafka, storage of topics is partition-based — a topic partition is entirely stored within and accessed by a single broker, whose capacity is limited by the capacity of the smallest node. Therefore, as data size grows, capacity expansion can only be achieved by partition rebalancing, which in turn requires recopying the whole partition for balancing both data and traffic to newly added brokers. Recopying data is expensive and error-prone, and it consumes network bandwidth and IO. To make matters worse, Kafka is designed to run on physical machines, as we are moving towards…