Dynaflow: Our Open-source Node Driver for DynamoDB

This is the story of how (and why) we at Vice Tech built Dynaflow, our high-level DynamoDB driver.

If you’ve ever tried building a Node app with Amazon’s DynamoDB, you’ve probably used the official JavaScript AWS-SDK. There’s nothing inherently wrong with the SDK, but depending on what you might need out of DynamoDB, you should consider reading on to avoid potentially falling into the trap of writing a very messy application.

It was a cold winter day when us unsuspecting developers were tasked to build a high-volume firehose for collecting anonymous data about the way our users interact with our website. Theoretically, if we collected the right data, we could use it to personalize our users’ experiences. Since the data is completely action-centric, we knew we didn’t need it to be relational. We considered MongoDB and a few others, but in the end we decided to give DynamoDB a try — it’s a managed service, it has auto-scaling, and it seemed ideal as a high-performance, high capacity, indexed storage.

We noticed something inspiring in the DynamoDB API: batch writes. In a single request we could save many new data events at the same time, saving the precious bandwidth which would certainly be the bottleneck of a high-volume firehose. At first glance this seemed like a no-brainer, but the devil is in the details. You see, the batch write feature only allows sending up to 25 items per request, and if every item failed, you’ll get an error response. Nothing out of the ordinary there, but here’s the kicker. If only some of those items fail, you’ll be provided with a successful response with an object describing which ones failed. In cases where you only need to do a single batch request, that’s no problem. But for a continuously flowing firehose of batch requests, partial failure was a very strange thing to deal with.

Being seasoned Node developers, the first place we looked when implementing a firehose is the built-in Node Stream API. In theory, we could create a WritableStream that our app uses to push data events. The stream would buffer those items until 25 were collected (remember, only 25 items per batch request) or a until a maximum timeout was exceeded (to clamp down on latency). After that, it would construct the batch request and send it off. Partial failures could just be deserialized and fed back into the original WritableStream. Not a bad solution, right? Well it turns out the solution was sound, but the tools we were using were not.

After meticulously crafting the pipeline within our app, we ran into 3 annoying problems:

#1: Node streams don’t propagate errors.

Contrary to the elegant error handling of promises, streams are much more like callbacks when it comes to error handling. Each stream must individually listen for the “error” event and respond accordingly. It’s a bit silly to say that streams are composable just because you can write:

input.pipe(transform1).pipe(transform2).pipe(output);

When you also have to write:

const failure = (err) => {
input.destroy();
report(err);
};
input.on('error', failure);
transform1.on('error', failure);
transform2.on('error', failure);
output.on('error', failure);

This is not intuitive or even explained at all in the standard Node documentation, which led to us having silent errors before we realized what was going on. This gets even worse when parts of the pipeline are hidden away within libraries and utility functions, as they usually are in real-world applications.

#2: Node streams don’t compose with promises.

We ended up having promise constructors which wrapped stream callbacks which wrapped async functions and more. It was a mess. Nothing fit together naturally. And trying to come up with one congruent way for handling errors in all the different places became more than enough reason to toss our computers in the trash and take up something more sane like law school (this is a joke). But seriously, it was bad.

#3: There was no clean way to query the data that we had saved.

When you query DynamoDB it paginates the result set, requiring you to do successive requests to get the entire payload. Without using callbacks or Node streams, there was no way to operate on query results as they come in, page by page. The cleanest way to handle them was to buffer them and provide a promise for an array of pages when they’ve all finally arrived — but this goes against the entire streaming design of the application!

The Solution

It was clear we were missing something. We needed a higher-level, more general purpose abstraction for handling these continuous asynchronous events. And whatever we used, it needed to compose well with promises.

A while back, some of our team members learned about something called Observables, a way of elegantly handling streaming events. Most Node developers have never used these little objects. We see this as a huge knowledge gap in our community. Without them, in a very academic sense, your async toolset is incomplete. What are observables, you might ask?

+----------------+--------------------+--------------------+
| | Single value | Multiple values |
+----------------+--------------------+--------------------+
| Synchronous | regular values | iterables (arrays) |
+----------------+--------------------+--------------------+
| Asynchronous | promises | observables |
+----------------+--------------------+--------------------+

Simply put, Observables are like object streams in Node, except they propagate errors like promises and are much more general-purpose. Streams in Node were originally designed to deal with chunks of bytes, with heavy support for mitigating backpressure and low memory constraints. They support “object mode” so they can deal with things besides bytes, but that feature was just tacked on. They were built for a very specific purpose, and they fail when applied to the generic problem of streaming async events in JavaScript. Observables were the true solution to the problem we were dealing with.

You may have heard about ReactiveX, a popular Observables design. Personally, when I first heard about them I was pretty ecstatic, but after looking into them for a while I decided their implementation didn’t fit well into the Node ecosystem. You see, their system was designed to be cross-language, and therefore it couldn’t be too Node-y. I got inspired to write my own Observables just for JavaScript. Being the (not so) hilarious guy that I am, I used the name “river” as a pun on “stream”. You can check out the package here.

When we finally understood the problem in our Dynamo app, we tried rewriting it with Rivers at its core. We used rivers to represent the paginated results of a query, and we used them to replace the Node streams in our batch writing pipeline. After an inevitably tedious redesign, the fit was seamless and our app worked flawlessly. Our code was clean, error handling was a breeze, and every asynchronous operations was being handled with precision. We separated the DynamoDB abstractions into their own package, and thus Dynaflow was born.

It was a long journey, but we’re really happy with the way things turned out. If you’re a JavaScript developer unfamiliar with the concept of Observables or Rivers, we really encourage you to explore them. And, after learning about everything they can offer, ask yourself, “is this the tool my app was missing too?”