Introduction to Apache Storm

Sai Ananth Vishwanatha
4 min readApr 15, 2020

--

Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data.Apache Storm is simple, can be used with any programming language.

Apache Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Components of a Storm cluster

A Storm cluster is similar to a Hadoop cluster. On Hadoop we run MapReduce Jobs where as on storm we run “topologies”. There are two kinds of nodes on a Storm cluster: the master node and the worker nodes.

The master node runs a daemon called “Nimbus”, which is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.

Each worker node runs a daemon called the “Supervisor”, which listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it.Each worker process executes a subset of a topology.

All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster and all the Nimbus and Supervisor daemons are fail-fast and stateless.

Topologies

To do computation on Storm, we create what are called as “topologies”.A topology is a graph of computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes.

For running a topology just package all your code and dependencies into a single jar. Then, run the following command:

storm jar all-my-code.jar org.apache.storm.MyTopology arg1 arg2

Data model

Storm uses tuples as its data model. A tuple is a named list of values, and a field in a tuple can be an object of any type.Storm supports all the primitive types, strings, and byte arrays as tuple field values.

Streams

The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Every stream is given an id when declared. OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of “default”.

Spouts

A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology .

Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on SpoutOutputCollector.

Bolts

All processing in topologies is done in bolts. Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts.

Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

When you declare a bolt’s input streams, you always subscribe to specific streams of another component.

The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack method on the OutputCollector for every tuple they process so that Storm knows when tuples are completed

What makes a running topology

Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:

  1. Worker processes
  2. Executors (threads)
  3. Tasks

A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors.

An executor is a thread that is spawned by a worker process. It may run one or more tasks.

A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster.

The following illustration shows how a simple topology would look like in operation. The topology consists of three components: one spout called BlueSpout and two bolts called GreenBolt and YellowBolt. The components are linked such that BlueSpout sends its output to GreenBolt, which in turns sends its own output to YellowBolt.

--

--