Storm Topology design paradigm:-Breaking down Topology into functional components
If we’re building a racecar, we need to keep performance in mind starting on day one. We can’t refactor our engine to improve it later if it wasn’t built for performance from the ground up.
Here I am going to show functional design approach of seeing a stream processing problem and break it down into constructs that fits within a Storm topology and get performance, scalability out of by optimizing and breaking down into functional components.
We are working with big data and every efficiency enhancement we make counts, it may me in terms of performance gain, productivity gain or scalability.
So, first to discuss is — problem domain, if your knowledge about the problem domain is limited, that might work against you if you try to scale it too early. When we say knowledge about the problem domain, what we’re referring to is both the nature of the data that’s flowing through your system as well as the inherent choke points within your operations. It’s always okay to defer scaling concerns until you have a good understanding of it. Similar to building an expert system, when you have a true understanding of the problem domain, you might have to scrap your initial solution and start over.
Therefore, my approach is to breakdown the operations within topology as series of functional components. The benefits of the effort spent on designing your components in this manner is the productivity gains from it will allow you to reap. You can decompose each bolt by giving a specific responsibility and each bolt represents a functional whole. Why it is important? Because parallelism is tuned at the bolt level and whether it’s scaling or troubleshooting a problem, you can zoom in and focus your attention on a single component.
Another factor to keep in mind is minimize the number of partitions
We can always start with the simplest functional components and then advance toward combining different operations together to reduce the number of partitions.
As we have factored the bolts into functional components, now try to find if we can collapse a few bolts together. You can consider various below factors to join bolts based on theirs behaviour: -
a. Find bolts interacting with external entity and because its external entity will dictate the way executors and tasks are allocated.
b. Bolts buffers data in memory and can’t proceed to the next step until the time interval has elapsed.
c. See if we can fit few responsibility of parsing, extracting and converting do fits in responsibility of spouts.
Scalability factor when creating our storm topologies
If we don’t do this early on and leave scalability concerns for later on, the amount of work you have to do to refactor or redesign your topology will increase by an order of magnitude. Therefore, keep in mind below scalability factors: -
a. Always examine your data stream, and determined that input tuples are based on what you have started with. Then determine the resulting tuples need to end up with in order to achieve end goal (the end tuples).
b. Create a series of operations (as bolts) that transform the input tuples into end tuples.
c. Carefully examine each operation to understand its behavior and scaled it by making educated guesses based on domain understanding of its behavior (by adjusting its executors/tasks).
d. At points of contention where you no longer scale, re think your design and refactor the topology into scalable components.
We have discussed the approach for topology design using simplest functional components and then minimize the number of repartitions by combining different operations together to reduce number of partitions. Later we can always add multiple responsibilities to tasks.