Overview of Apache Storm: Architecture & Benefits

upGrad
5 min readAug 28, 2020

--

Data is ubiquitous, and with increasing digitization, there are new challenges coming up every day with respect to managing and processing of data.

Having access to real-time data might just seem like a “nice-to-have” feature, but for an organization with significant investments in the digital sphere, it is almost a necessity.

Which Industry Leaders are Using Apache Storm?

Often, data that isn’t analyzed at a given time might soon become redundant for companies. Analyzing data to find patterns that can be of advantage to the company is a requirement. Patterns don’t need to be deduced over a long time; just the relevant data that dictates real-time, current trends should be extracted.

Considering the needs and returns of analyzing real-time data, organizations came up with various analytics tools. One such tool is Apache Storm.

What is Apache Storm?

Released by Twitter, Apache Storm is a distributed, open-source network that processes big chunks of data from various sources. The tool analyzes it and updates the results to a UI or any other designated destination, without storing any data. Read more about Apache Storm.

Apache Storm does real-time processing for unbounded chunks of data, similar to the pattern of Hadoop’s processing for data batches.

Originally created by Nathan Marz at Black Type, a social analytics company, it was later acquired and open-sourced by Twitter. Written in Java and Clojure, it continues to be the standard for real-time data processing in the industry.

Apache Storm Architecture

1. Nimbus (Master Node)

Nimbus is a daemon, i.e. a program that runs in the background without the control of an interactive user. It runs for Apache Storm, similar to the workings of Job tracker in Hadoop. Its function requires it to assign codes and tasks to machines and even monitor their performances.

2. Supervisor Service (Worker Node)

The worker nodes in Storm run a service called Supervisor. These nodes are responsible for receiving the work assigned by Nimbus to these machines. Aside from handling all the work assigned by Nimbus, it starts or stops the process according to requirement.

Each of these processes by Supervisors helps execute a part of the process to complete the topology.

3. Topology

Storm Topology is a network consisting of spouts and bolts. Every node in the system is present to process logics and links, and demonstrate the paths from where the data will pass.

Whenever a topology is submitted to the Storm, Nimbus consults the Supervisors about worker nodes.

4. Stream

Streams are a sequence of tuples that are created and processed in a parallel distributed fashion. But what are tuples? They are the main data structures in Storm. They are named lists of varied values like integers, bytes, floots, byte arrays, etc.

5. Spout

A Spout is an entryway for all data in tuples. It is responsible for getting in touch with the actual data source, receiving the data continuously, transforming it into tuples, and finally sending it to bolts to be processed.

6. Bolts

Bolts are at the heart of all the logic processing in Storm. Therefore, they perform all the processing of the topology. Bolts can be used for a variety of functions, including filtering, functions, aggregations, and even connecting to databases.

Learn about: Apache Spark Architecture

Why Apache Storm?

The workings of Apache Storm are quite similar to that of Hadoop. Both are distributed networks used for processing Big Data. They offer scalability and are widely used for business intelligence purposes. So, why Storm and why is it so different?

Here are the key reasons to choose Storm:

  • Storm does real-time stream processing, while Hadoop mostly does batch processing.
  • Storm topology runs until shut down by the user. Hadoop processes are completed eventually in sequential order.
  • Storm processes can access thousands of data on a cluster, within seconds. Hadoop Distributed system uses the MapReduce framework to produce a vast amount of frameworks that will take minutes or hours.

Organizations that use Apache Storm

Once deployed, Storm is not only easy to operate but is also able to process data in seconds. Considering the ample benefits of Storm, many organizations have put it to use.

1. Twitter

Apache Storm powers a range of functions at Twitter. Storm integrates well with the rest of Twitter’s infrastructure, which has database systems like Cassandra, Memcached, Mesos, the messaging infrastructure, monitoring, and alerting systems.

2. Infochimps

Infochimps uses Storm as a source for one of its cloud data services — Data Delivery Services. It employes Storm to provide a linearly expandable data collection, transport, and complicated in-stream processing of cloud services.

3. Spotify

It is undoubtedly the leader in platforms for streaming music. With 50 million users around the world and 10 million subscribers, it offers a massive array of real-time content like music recommendations, analytics, ad creations, etc. Apache Storm aids Spotify in delivering these features accurately.

It has also enabled the company to deliver low-latency fault-tolerant distribution systems easily.

4. RocketFuel

RocketFuel is a company that harnesses the power of Artificial Intelligence to scale-up marketing ROI in digital media. They are looking to build a platform on Storm that can track impressions, clicks, bid requests, etc. in real-time. This platform is supposed to work by cloning critical workflows of the Hadoop-based ETL pipeline.

5. Flipboard

Flipboard is a one-stop-shop for browsing and saving all news that interests you. At Flipboard, Apache Storm is integrated with systems like Hadoop, ElasticSearch, HBase, and HDFS to create extremely expandable platforms.

Here, services like content-search, real-time analytics, custom magazine feed, etc. — are all provided with the help of Apache Storm.

6. Wego

Wego is a travel metasearch engine that originated in Singapore. Here, data comes from all over the world, at different timings. With the help of Storm, Wego is able to search for real-time data, resolve any coexisting issues and provide the best results to the end-user.

Also Read : Role of Apache spark in Big Data.

Conclusion

Before Storm was written, real-time data was processed using queues and worker thread approaches. Some queues will be continuously writing data, and others would be constantly reading and processing it. This framework was not just extremely fragile but also time-heavy. A lot of time would be spent taking care of data loss, maintaining the entire framework, serializing/deserializing messages rather than performing the actual work.

Apache Storm is a clever way to just submit the data as Spout and Bolt and the rest of the processing as Topology.

Apache Storm is a prevalent, open-source, and stream processing computation framework for real-time analyzing of data. Many organizations are already using it; in fact, some are developing better and helpful software with it.

If you are interested to know more about Big Data, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.

This article originally published on upGrad blog.

--

--

upGrad

Learn skills that define the future of work. Follow upgrad.com/blog to stay updated on Data Science, ML, Big Data, Blockchain, Digital Marketing & MBA