Realtime Stream Processing Architectural Solution

7 min readMay 31, 2020

In my previous post I have described the feasibility study on technology selection for a realtime stream processing application. In this post I will walkthrough the high-level architectural of the application and some of the non functional aspects.

As described in my previous post, the proposed application should cater the requirement of:

Capture large amount of live-stream of data from different input sources
Realtime data processing
Support client-server architecture and client should be able to view live data which produced as the output of realtime data processing
Support historical and live data analyzing

1. Logical Component View

The goal is to provide a decomposition of the sub-systems and components shown on the System Architecture Overview diagram.

1.1 Enterprise Hosting

1.1.1 Message Broker

Apache Kafka is utilized as the Message Broker technology.

Technology Aspects

Apache Kafka is a publish/subscribe message broker which combines queuing with message persistence on a disk based log file. Think of it as a commit log that is distributed over several systems in a cluster. Messages are organized into topics and partitions, and each topic can support multiple publishers (producers) and multiple subscribers (consumers). Messages are retained by the Kafka cluster in a well-defined manner for each topic:

For a specific amount of time
For a specific total size of messages in a partition
Based on a key in the message, storing only the most recent message per key

Kafka provides reliability, resiliency, and retention, all while performing at high throughput.

Logical Component View — Kafka Message Broker

A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this:

Logical Component View — Kafka Message Broker Partitioning

Also Kafka has provides adaptors for multiple type of consumers as specified in below diagram.

Logical Component View — Kafka Message Broker Adaptors

1.1.2 Message Processor

The responsibility of message processor is to process data feeds received through Kafka and send processed data to respective destinations. Apache Storm has been selected as the real-time message processor.

Technology Aspects

Storm is a distributed real-time computation system for processing large volumes of high-velocity data.
Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size.
There are two kinds of nodes on a Storm cluster: the master node and the worker nodes.
The master node runs a daemon called “Nimbus” which is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.
Each worker node runs a daemon called the “Supervisor”. The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.

Apache Storm has connectors to the Hadoop Kafka Topics, Hadoop HDFS and MapReduce jobs. Storm load all the GPS data routing rules to the In-Memory cache from Hadoop HDFS. Once stream data consumed from Kafka, first it publish processed data based on routing rules to the destination Kafka queues. Parallely, it processes the consumed input data and as well as analytical data produced by Hadoop MapReduce jobs and published the output to the respective destination Kafka queues.

1.1.3 Data Persistence — Hadoop HDFS, HBASE and MapReduce

Hadoop HDFS will be utilized to store the ticketing historical information and MapReduce job will help to generate train scheduling and future planning recommendations information. Hadoop HBASE will be uses to store the output of MapReduce, GPS routing rules, etc.

Technology Aspects

HDFS is a fault-tolerant and self-healing distributed filesystem designed to turn a cluster of industry-standard servers into a massively scalable pool of storage.
Developed specifically for large-scale data processing workloads where scalability, flexibility, and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high-bandwidth streaming, and scales to proven deployments of 100PB and beyond.
MapReduce is a powerful batch execution engine. MapReduce is designed to process unlimited amounts of data of any type that’s stored in HDFS by dividing workloads into multiple tasks across servers that are run in parallel and generate expected outcomes

1.1.4 Web Server

The proposed solution should support multiple web services in order to communicate with client applications.

Technology Aspects

All the web services will be built by utilizing Apache Axis2 web service development framework

Apache axis2 is a core engine for web services, most commonly used, matured & stable web services development framework which supports multiple languages

Apache Axis2 facilitates following core functionalities and features:

Provide a framework to process the SOAP messages. The framework should be extensible and the users should be able to extend the SOAP processing per service or per operation basis. Furthermore, it should be able to model different Message Exchange Patterns (MEPs) using the processing framework.
Ability to deploy a Web service (with or without WSDL)
Provide a Client API that can be used to invoke Web services. This API should support both the Synchronous and Asynchronous programming models.
Ability to configure Axis2 and its components through deployment.
Ability to send and receive SOAP messages with different transports.

Apart from the above functionalities, performance in terms of memory and speed is a major consideration for Axis2.

2. Deployment View

This section describes one or more physical network (hardware) configurations on which proposed solution can be deployed and run. It is a view of the deployment model along with a mapping of the logical components onto the physical nodes.

Following deployment view represents how the solution can be deployed in Amazon Cloud.

3. Scalability View

3.1 Scalability of Message Broker (Apache Kafka)

Message broker Kafka’s ability to scale is achieved through its clustering technology. It relies on the concept of data partitioning in order to distribute the load across members in a cluster. If you need more data throughput in a particular location, you simply add more nodes to the Kafka cluster. Many cloud providers, like ours, and infrastructure automation frameworks provide the mechanisms to automate the creation and scaling of Kafka clusters.

The following diagram portrays how the environment can be scaled-out as the load grows.

Apache Kafka Scalability Overview Diagram

3.2 Scalability of Message Processor (Apache Storm)

Apache storm provides highly scalable architecture my utilizing it’s inherently parallelism pattern. Apache Storm real-time data executing jobs named as Storm topology. Storm topology contains one or more Worker Processes. A worker process belongs to a specific topology and may run one or more executors. An executor is a thread that is spawned by a worker process.

Storm topologies are inherently parallel and run across a cluster of machines since topology consists of many such processes running on many machines within a Storm cluster. Different parts of the topology can be scaled individually by tweaking their parallelism.

The “rebalance” option of the “Storm”, client can dynamically adjust the parallelism of running topologies on the fly.

Apache Storm Scalability Overview Diagram

3.3 Scalability of Web Server (Apache Axis2)

Web server utilize the Apache Axis2 Web Service framework. Apache Axis2 provides clustering support to add Scalability, Failover and High Availability to your Web Services.

In order to maintain the same level of serviceability (QoS) during an increase in load you need the ability to scale. Axis2 provides replication support to scale horizontally. That is, you can deploy the same service in more than one node to share the work load, thereby increasing or maintaining the same level of serviceability (throughput etc).

By deploying it in Amazon Cloud services, it could utilize the Amazon Elastic Scalability features. Once the load is increases, it automatically assign new Nodes and once the load is decreases it deallocate those nodes as specified below.

**Apache Axis2 Elastic Scale in Amazon Cloud**

3.4 Scalability of Data Storage (Apache Hadoop)

Hadoop HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

Hosting Hadoop on Amazon cloud will provide scale on-demand by expanding horizontally to accommodate growing data volume and can process unstructured and structured data in the same environment. Today, you can spin up a performance-optimized Hadoop cluster in the AWS cloud within minutes on the latest high performance computing hardware and network without making a capital investment to purchase the hardware. You have the ability to expand and shrink a running cluster on demand. This means if you need answers to your questions faster, you can immediately scale up the size of your cluster to crunch the data more quickly. You can analyze and process vast amounts of data by using Hadoop’s MapReduce architecture to distribute the computational work across a cluster of virtual servers running in the AWS cloud.

Realtime Stream Processing Architectural Solution

1. Logical Component View

1.1 Enterprise Hosting

2. Deployment View

3. Scalability View

3.1 Scalability of Message Broker (Apache Kafka)

3.2 Scalability of Message Processor (Apache Storm)

3.3 Scalability of Web Server (Apache Axis2)

3.4 Scalability of Data Storage (Apache Hadoop)

Written by Nuwan Kaduruwana