Distributed Deployment Pattern for WSO2 Stream Processor
WSO2 Stream Processor is a Streaming SQL based, high performant, lightweight, open source stream processing platform offered by WSO2. It allows you to collects events, analyzes them in real-time, identify patterns, map their impacts, and react within milliseconds.
Take a Deep dive in to WSO2 SP using this link
https://docs.wso2.com/display/SP400/Stream+Processor+Documentation
There are three types of WSO2 SP clusters.
- Minimum High Availability Deployment
- Distributed Deployment
- Multi Datacenter High Availability Deployment
In this post I will focus on the architecture of Distributed Deployment Pattern of WSO2 Stream Processor and how a distributed siddhi application will be executed after it gets deployed. Before diving in to that, Let’s see what is a distributed system.
Distributed
A distributed system is one which is split into multiple running machines, all of which work together in a cluster to appear as one single node to the end user. The benefits to this approach are high availability and fault-tolerance.
This deployment pattern is suitable in such scenarios where single SP fails to manage high amount of data as it allows complex event processing in a distributed manner. i.e High volume of data will be distributed among multiple SP instances instead of having them accumulated at a single point.
Let’s understand the architecture of the Distributed Deployment.
From the following figure, you can get an idea about work flow of this deployment pattern.
Manager cluster
The manager node is the essential component in the distributed cluster as it functions as the brain of this cluster. The manager cluster contains two or more WSO2 SP instances where one will be active node and others are passive nodes. These nodes are configured to run in the high availability mode as when active manager node goes down, other node in the cluster will be elected as the manager to handle the resource cluster. The manager cluster is responsible for parsing a user-defined distributed Siddhi application, dividing it to multiple Siddhi applications according to the execution group, creating the required topics and then deploying them in the available resource nodes. The manager cluster also handles resource nodes that join/leave the distributed cluster, and re-schedules the Siddhi applications accordingly.
Resource cluster
A resource cluster contains multiple WSO2 SP instances. Each instance sends a periodic heartbeat to the manager cluster so that the manager can identify the active resource nodes. The resource nodes are responsible for running Siddhi applications assigned to them by the manager nodes. A resource node continues to run its Siddhi applications until a manager node undeploys them, or until it is considered as unresponsive. A node is considered inactive when it fails to return heartbeats for a specified amount of time. If a resource node is identified as inactive, manager node undeploys its Siddhi applications waits for available resource nodes to reschedule.
Deployed Siddhi applications communicate among themselves via Kafka topics.
User can identify which manager node is active and which resource nodes are in active state through the logs printed in its corresponding terminal.
Kafka cluster
Kafka is a distributed streaming platform. Kafka is distributed in the sense that it stores, receives and sends messages on different nodes (called brokers). The Kafka cluster stores streams of records in categories called topics. Here Kafka cluster holds all the topics used by distributed Siddhi applications. All communications between execution groups take place via Kafka. You can only publish and receive data from distributed Siddhi applications via Kafka. In order to do so, you can either define a Kafka source in the initial distributed Siddhi application or use the Kafka source created by distributed implementation. Note that installing Kafka and Zookeeper is a prerequisite to configure a distributed deployment.
Take a deep dive into kafka
https://kafka.apache.org/intro
Now let’s see a example of Distributed Siddhi Application and how it is deployed in this cluster.
Distributed Siddhi applications
In a distributed Siddhi application, an execution group is a single unit of execution. All the executional elements with the same execution group are executed in the same Siddhi application. For each execution group, a specified number of parallel Siddhi application instances are created. This is done via the @dist
annotation.
e.g., The following distributed Siddhi application contains two execution groups named group1
and group2
. No specific number of parallel instances are specified for group1
, and therefore, only one instance is created for it at runtime by default while two parallel instances are specified for group2
.
@App:name('wso2-app')@info(name = ‘query1')
@dist(execGroup='group1')
from TempStream#window.time(2 min)
select avg(temp) as avgTemp, roomNo, deviceID
insert all events into AvgTempStream;@info(name = ‘query3')
@dist(execGroup='group1')
from every( e1=TempStream ) ->e2=TempStream[e1.roomNo==roomNo and (e1.temp + 5) <= temp ] within 10 min
select e1.roomNo, e1.temp as initialTemp, e2.temp as finalTemp
insert into AlertStream;@info(name = ‘query4')
@dist(execGroup='group2' ,parallel ='2')
from TempStream [(roomNo >= 100 and roomNo < 110) and temp > 40 select roomNo, temp
insert into HighTempStream;
The following is an illustration of how each parallel instance is created as a separate Siddhi application.
From the above figure, you can understand when this Siddhi application is deployed, how it will be executed.
Each Siddhi application is deployed in the available resource nodes of the distributed cluster. All these Siddhi applications communicate with each other using Kafka topics. The system creates Kafka topics representing each stream and configures the Siddhi applications to use these topics as required.
For detailed information, see Converting to a Distributed Streaming Application.
I think, now you got a clear idea about this deployment pattern. We will see how to configure a distributed cluster in the next post.