Introducing Astra Streaming: a Multi-Cloud, Managed Solution for Real-Time Data Streaming
Author: David Dieruf
Apache Pulsar™ provides highly-scalable, open-source event streaming capable of keeping up with even the largest fleet of IoT devices. But traditionally, Pulsar requires some developer effort and time to configure and manage it. With DataStax Astra Streaming, you no longer have to get too deep into tinkering with multiple configurations.
Apache Pulsar’s superior scalability and resilience has earned its reputation as a top distributed messaging and streaming platform for many enterprises and cloud-native developers today.
When getting started with Pulsar as a developer, you don’t want to fuss with configurations. Instead, you just want Pulsar right away. This is what DataStax Astra Streaming is for: a multi-cloud, fully-managed Pulsar service.
Astra Streaming offers a unified event streaming, queuing, and publisher-subscriber service at any scale, with massive throughput, and low latency. This lets you easily build different streaming applications on any cloud of your choice without cloud vendor lock-in.
In this post, we take a look at what makes Pulsar so powerful, how Astra Streaming simplifies the configuration and management of Pulsar instances, and how it streams critical data in real time.
What is Apache Pulsar?
Apache Pulsar is a distributed, open-source publisher-subscriber (pub-sub) messaging and streaming platform for real-time data workloads with the ability to manage hundreds of billions of events per day. Some of its strengths include:
- Stateless brokers that scale independent of its storage
- Guaranteed message delivery
- Stream processing with serverless functions
- Geo-aware replication
- Tiered storage
A Pulsar cluster is a logical grouping of needed components giving the broker low-latency and high-throughput capabilities. When you combine multiple Pulsar clusters together (usually for geo-replication), an overall Pulsar instance is formed. A Pulsar cluster can have a single or multiple brokers running concurrently depending on the cluster’s load.
A typical cluster is made up of the following:
- Broker. This is a stateless component that serves as an endpoint to manage administrative tasks and topic lookups, as well as acts as a message dispatcher to both consumers and producers. Depending on the message load, a cluster can have one or more broker instances running.
- Apache Zookeeper. This is the metadata store for the cluster, carrying information like topic details, schema details, and each component’s configuration.
- Apache Bookkeeper. Pulsar guarantees message delivery to all producers. If a message reaches the Pulsar broker, it will be delivered. Part of this guaranteed delivery requires that non-acknowledged messages are stored in a durable way until they are delivered to consumers. Bookkeeper enables this persistent model with at least one bookie present in every cluster.
- Proxy. While optional, a Proxy is highly recommended in production environments to not expose the broker itself to outside access. A Pulsar cluster can have multiple brokers running at any given time and the proxy acts as a secure gateway to all the brokers.
- Function workers. If a certain topic relies on a function to consume and process its messages, bottlenecks can form quickly. By default, functions run as a separate process within brokers, but production clusters can quickly outgrow that design. Function workers scale independently of the broker and grow with the cluster.
But if you just want to get going with Pulsar instead of fiddling with all these components, you’ll want to meet Astra Streaming.
What is Astra Streaming?
Astra Streaming is a fully-managed, cloud-native streaming-as-a-service built on Apache Pulsar.
As a managed solution, Astra Streaming eliminates the overhead to install, operate, and scale Pulsar. You can quickly create Pulsar clusters and tenants, scale across cloud regions, and manage Pulsar resources such as topics, connectors, functions, and subscriptions.
Astra Streaming also offers out-of-the box support and interoperability between Java Messaging Service (JMS), RabbitMQ, Kafka, and Pulsar in a single platform. This means if your existing applications are relying on these platforms, you can immediately convert them into streaming apps without any code changes.
It’s also the simplest way to build real-time data pipelines. With the built-in capabilities of Pulsar functions, you can quickly write code to perform in-stream processing of your event data. Using Java or Python, you can enrich, filter, and transform message data for efficient and lightweight event stream processing.
Astra Streaming also integrates with DataStax Astra DB, the multi-cloud database-as-a-service for NoSQL database, Apache Cassandra®. Together, you can capture both data-at-rest and data-in-motion in real time.
Here are some key features of Astra Streaming to take note:
- Better together with Astra DB. Naturally complementing Astra DB, Astra Streaming allows users to build real-time data pipelines in and out of their Astra DB instances.
- Cloud agnostic. Avoid vendor lock-in and deploy on any major public cloud compatible with Pulsar, such as Amazon Web Services, Google Cloud Platform, or Azure.
- Handling high-volume queuing. This means high-volume queuing, pub-sub messaging, and other complex messaging patterns are taken care of.
- Compatibility. Astra Streaming has high compatibility and functionality with cloud-native software like Apache Kafka and JMS.
Getting started with Astra Streaming
To get started with Astra Streaming, sign up for a free Astra DB account or log into the platform using your GitHub account. Then, create a new tenant. Tenants are Pulsar’s way of logically separating cluster workloads.
Within a tenant, you can have different permission levels to manage namespaces and topics. The first namespace is called “default”. Namespaces are logical ways of categorizing topics into a tenant. A development team could create different namespaces to represent different environments, like dev, test, and production, while data science teams can use namespaces to represent different modeling or tests.
A namespace is made up of topics, which are the lowest level of message separation. The first topic in a namespace is called “public”. Typically, a namespace has multiple topics. You can mark a topic as persistent, meaning all acknowledged messages are stored for future lookup, or non-persistent, indicating that messages are removed as they are acknowledged.
Topics contain cursors that serve as a way to start reading messages at a given point. They also have a method for queuing certain types of data for further processing or writing to an external system, as well as multiple version schemas. For example, incoming message data is validated against its model, guaranteeing data integrity.
Producers write a message containing data to a topic, while consumers subscribe to a particular topic to receive that message. However, the speed at which messages are produced and consumed is typically quite different.
The combination of Pulsar’s guaranteed message delivery and persistent design makes it ideal for high-throughput applications. Whether written in Java, Python, or Go, Pulsar provides a simple model for you to process data in real-time.
Once your topics are created and a function is loaded, you can navigate to Astra DB’s ‘Try Me’ tab to see everything in action.
You can produce messages on the topic and consume messages on another topic with a function in between to augment the message data. For example, if you add and send a message with the message “hello there”, it’ll be sent back to you with augmented values attached.
Functions are powerful for processing real-time data as they can watch multiple topics for messages as well as chain together to create real-time data pipelines.
Speaking of data, a Pulsar cluster is just one stop along its journey. There are many ways data can be created and stored. These include sources and sinks provided by Pulsar to fast-track connections between systems and make the process more efficient.
- Sources. These are ways of telling Pulsar to watch for data in a specific system, and produce it on a certain topic. Sources are within Pulsar’s control.
- Sinks. These allow Pulsar to watch certain topics and write a message’s data to another system. Sinks essentially feed data from Pulsar into external systems, which typically include both SQL and NoSQL databases.
Astra Streaming makes sources and sinks easy. Just choose the “tab”, name it, and select the type from the list of supported objects. Then, provide the connection information and Astra Streaming will take care of the rest.
In addition to the native Pulsar port, each tenant you create in Astra Streaming offers a dedicated port for Apache Kafka and RabbitMQ messaging over AMQP. These two ports receive messages that are native to their respective use:
- If your applications are using Kafka, you can simply change their connection information to your Pulsar streaming tenant.
- Since RabbitMQ uses AMQP protocol to format messages, you can point the producing application to your Pulsar streaming tenants and AMQP port. Messages will then flow to appropriate topics.
Pulsar’s cloud-native, open-source design means there’s no distinction between pub-sub style messaging and stream processing messages. In other words, you won’t need to make any distinction between different types of workloads when using Astra Streaming. Instead, all messages simply flow through at a high velocity.
Astra Streaming allows you to produce and consume messages in minutes, guaranteeing durable production-ready connections for all legacy systems. It also provides you with the capabilities to power these experiences with real-time streaming data for all your responsive applications.