Riders on the (Apache) Storm
I’ve been working with Apache Storm for last 6 months. The choice of the tool was not mine, though it did sound reasonable at that time. Today I know we were wrong and Apache Storm is no longer with us.
Nevertheless, there is a few lessons I concluded from my adventure with Apache Storm, which hopefully will led to better technological choices in the future.
Here they are:
Is it computing or software engineering?
The border between those two can get blurry. Let me try a few distinctions: if your primary concern is what to do to get the correct answer, it’s computing. If it’s obvious what to do, but not that obvious how to do it (in terms of design, performance etc.) it’s software engineering. If you need a piece of paper to confirm you have a correct result, it’s computing. If you deal with numbers, it’s computing; if you deal with objects, it’s software engineering. If you care about correct answer only, it’s computing. If you care about side effects, it’s software engineering.
In my view, Apache Storm is rather a computation engine, defined by its creators as “distributed realtime computation system” and I suspect it would be a great tool to do just that. We mixed two worlds and tried to use it primarily with objects and cared not only about the result, but mostly about side effects. We found ourselves fighting with strange cases and problems — “how will TCP connections work, when opened from bolts? how to do dependency injection in Apache Storm? oh no, I need to write a custom serializer!” We couldn’t use object oriented code in a way we wanted to and that led to a lot of frustration.
The advice for the future would be: if you need to compute something, do it on raw data.
I guess if we needed computations, the trouble to learn Storm and it’s concepts would be balanced with the advantages it gave us. I suspect that if we tried to extract computations from our solution, we would realize that, in majority, we weren’t transforming data, but rather re-packaging structures that held it to conform with various surrounding API’s and interfaces.
If developers find a tool to be not developer-friendly, they may be onto something
Since I started working with Storm, I complained about difficulties with local development and integrating Storm with CI/CD pipeline. I considered a long feedback loop to be a problem (build a jar, submit a jar, trigger a computation, see the results). I couldn’t use the patterns I wanted and had problems with dependency injection.
I kept blaming and being blamed: one day it was Storm that was a crappy tool with bad documentation, another day it was me who was incompetent and unable to learn it. I still second the motion about documentation (try to find something on test api!), but those radical opinions had their source somewhere else — in a mismatch between the task at hand and a tool in it.
The advice would be: Developer’s happiness with given tool should be an argument when making choices of technologies and evaluating those choices. Prolonged unhappiness is often a sign of a mismatch between the tool and the task.
Carefully consider what “reliable” means in your case — fail fast may be the best option.
One of the primary reasons for choosing Storm was it’s “at least once” delivery guarantee for tuples (tuples are units that are passed from one Storm Bolt to another) and also its elegant way of handling failures. Bolts implement tasks performed for every tuple that appears on a stream to which a bolt is subscribed. For example, you may have one bolt that writes to storage, another performing data transformation and another doing some filtering. They communicate with each other by emitting streams of tuples. Tuples can be anchored in tree-structures, that is, in one tuple down the tree fails, the root tuple (and any resulting tuples) will get replayed.
In our case, when designing how to split tasks between Bolts, and which of them must never be omitted (that is, we require Storm to replay them indefinitely, perhaps alarming humans in the meantime) and which can fail from time to time (write to cache, for example) we completely overlooked the fact that processing we do is the entry point to the rest of the system. That, combined with anchoring tuples led to the solution that was overflowing its clients with data when something went wrong.
Let see very simplified example:

We had an input Spout (a special kind of Bolt, which reads tuples from source) that sent data to data transformation bolt. Data transformation Bolt would then, in turn, send data to tasks that would write it to cache and to permanent storage. Only after that succeeded, data would be send to clients. The problem at that time was that write call to the storage took some time, and it was important for us that clients received data quickly. So one day someone came, and “fixed” it:

“It worked good, when it worked”, but here is where we overlooked the anchoring: permanent storage bolt was not allowed to fail under any circumstances. So when that happened, the whole tree of tuples got replayed. As a result, with each failure to write a transformed piece of data to the storage, the same data was sent to clients, over and over again.
Choosing the behavior in case of failure is an architectural choice and should be driven by requirements, but in cloud there is one more factor to consider. Replies may not be harmful from logical point of view — your software may be idempotent — but may be harmful financially, if you pay for CPU power or function invocation. One more reason to pay attention.
The advice would be: make sure you know how your clients would prefer you to handle failures. The outage may not cost them as much as “reliability” by all means.
