Donut Factory Line, photo by Irina Slutsky, CC BY 2.0

Bake Your Data With BakeryJS

What Is BakeryJS & Why We Made It

BakeryJS is a Node.js framework for data processing. It helps you contain your data processing logic into independent components; you can think of it like “React for data”. It is also the first result of Socialbakers’ open-source initiative, more projects will follow.

What Problems Does It Solve?

At Socialbakers, we process over 50 millions of posts, comments, and direct messages from various social networks each day. To make things complicated, we are fetching different sets of data for each of the 30 million social media profiles. For example, some clients are interested only in reporting features while others monitor social media and respond to customer inquiries. We need to support these different use cases in our code base. Data are further converted and distributed in different formats to various subsystems. We are also calculating various metrics on data, like detecting comments’ sentiment. As features get added and reworked, code gets more complex and usually ends up like spaghetti. We have built BakeryJS to address these complexities and we hope it will help you simplify your code base as well.

The basic piece of logic in BakeryJS is a component or box. The component can generate, process, convert and enrich data, which gets delivered in messages. You declare how the components get wired up into a data flow. BakeryJS takes care of delivering messages between components and gives you tools for observing, debugging, and instrumenting your application.

Components in BakeryJS should help you to split up your problem domain into small reusable components, each containing a specific functionality. Since React popularized component-based development in frontend world, we would like BakeryJS to do a similar service for backend JavaScript development.

How To Use It?

Note: BakeryJS is still in beta, API is subject to change.

Let’s take a look at a sample component in BakeryJS which counts words in a text:

A component is created using the boxFactory function, which takes three arguments. The first one is the name of the component; this is used in error reporting.

The second argument is the component’s metadata. Each component provides or requires some kinds of data. Other component generated a message which includes the text property, it is consumed by the wordcount component which itself provides the words property. The component can also emit multiple messages at once and aggregate more messages into one message; we will take a look at these later.

The last argument is the component’s body, a function which defines its behavior. In this case, it takes two arguments: serviceProvider is an object used for sharing services in multiple components, e.g. logger. The second argument, message, is an object which should contain properties specified in requires metadata — hence we can access the incoming text inside message as message.text.

Messages Generator

The component above won’t work on its own since there is no other component providing the text property for it. We need a component which creates “something from nothing”, i.e. generates data. This is how you actually get some data into the flow:

The component has a familiar structure, except it does not require anything and emits a text. When the emit property is not empty, BakeryJS expects the component to generate multiple messages in time. Note that the component body is an async function and there is a third argument now: emit. This function expects an array of messages to be sent to the flow. While this example is rather primitive, you can imagine more sophisticated generator which, for example, listens for HTTP requests, consumes a queue or reads a database or a CSV file.

Running the Flow

There is one last thing to see these components in action; create a program:

The Program in BakeryJS loads components and runs a given data flow. The constructor takes two arguments: the first is serviceProvider with services available for each component. The second is options object, the property componentPaths tells the program where to look for components used in data flows. In this case, we put both our components into components directory.

Now it’s show time. The process property in the job defines a data flow, in what order the components should be executed. Each array corresponds to one “level” of the flow; in this case, messages from helloworld are passed to wordcount. More components can also process the same messages on the same level:

process: [
['helloworld'],
['wordcount', 'punctcount'],
['checksum'],
]

In this example, both wordcount and punctcount components process messages from helloworld, and checksum can receive messages from both of them.

Finally, the run method takes our specified job and optional “drain” function, which receives all the final messages. When you run this script, BakeryJS will give you quite verbose output:

First, it prints the flow to be run and then we get an output of our drain function as the messages pass through the flow. As you can see, in each message there is our original text and words count.

The program also emits events which let you observe the messages as they flow between components. We use this to graph how much time each component takes to run:

We track component’s average run time with StatsD and graph the results with Grafana.

You can find a more complete example in the repository.

How Does It Compare To…?

BakeryJS wasn’t built in a vacuum, and we took a look at existing projects and paradigms which approach data processing in a similar way.

Streams and observables (like RxJS) let you create complex data flows. These are useful tools albeit a bit low level for our purpose. Since BakeryJS is a framework, the goal is to provide a necessary scaffolding for developers, which lets them easily add new components. We would like to provide streams and observables as an option for communication between components in BakeryJS.

Close to BakeryJS is the programming paradigm called Flow-based Programming (FBP), and related projects like NoFlo, Node-RED, and Apache NiFi. We actually built a production system on Apache NiFi before starting BakeryJS, but the visually oriented approach to programming turned out to be quite painful for our needs. Although we are working on a visualization of data flows for BakeryJS, it won’t be the primary way to build flows. Components in most FBP systems also don’t declare types of data they accept or provide; we would like to use components’ metadata to provide type checking and static analysis of flows. Finally, BakeryJS will support only direct flows with no cyclic structures, which can be commonly seen in FBP systems.

BakeryJS shares some similarities with workflow engines like Luigi (which was also an inspiration) and Apache Airflow. These engines can run in a cluster and support large-scale operations. BakeryJS is designed to run within a single Node.js process where we don’t need CPU-heavy processing. This makes BakeryJS easy to gradually integrate into existing infrastructure.

There is also a family of projects focused on building large-scale stream processors, such as Apache Storm, Apache Apex, and notably Apache Beam (which provides an unified API for the former). While building on such robust platforms would be an interesting engineering challenge, it would bring another layer of complexity to our projects. We have rather simple scalability needs, but it is possible that some of our services will outgrow the single-process model, and Apache Beam might be the next step.

One distinct feature of BakeryJS compared to the aforementioned projects is a support for dynamic data flows. Data flows don’t need to be defined in advance but can be provided and executed on runtime. Instead of having specific components which decide where the data in the flow should go, we can just simply add or remove components from the flow as needed. This is also a rationale behind components’ metadata, letting us validate data flow validity before execution.

What’s next?

Next week we are conducting a BakeryJS workshop as part of ReactiveConf and we look forward to feedback from attendees. Since BakeryJS is in early beta, there are still many rough edges. Our first goal is to improve, stabilize, and document the public API. We also plan to make BakeryJS extendable and provide plugins for observability of data flows through instrumentation, tracing, visualizations, and time tracking. We also plan to establish best practices for components testing.

For long-term plans, we would like to experiment with different approaches to data delivery between components, including streams, observables, or offloading messages to an external queue. Since BakeryJS is built with TypeScript, we hope we will be able to leverage types for messages and provide tools for data validation.

Check out BakeryJS on GitHub. What will you bake with it? 🎂