Building data processing and visualisation pipelines in the browser with Exynize
If you ever had a task to create a data processing and visualisation system, you know how much work it is.
And I’m not even talking about writing the actual processing algorithms and creating visualisations. Unless such systems are the core of your business and you’re creating them on a regular basis, you’ll be spending most of the time scaffolding out you backend as well as frontend and making sure they interact correctly.
And we all know that doing front-ends in 2015 is quite a challenge — just look at all the tools, frameworks and libs you may want to use for it!
Backend is not too different — there are plenty of languages, frameworks and approaches to pick from.
And then you have to worry about storing results in database, caching all those other minor things that I have forgotten right now.
Oh, and what if you need real-time processing? And don’t forget the scaling — what if you need to process large amounts of data? Can your system scale?
As you can see, there’s a lot of things to do here. And that’s something that people who deal with data processing have been doing for quite some time. I’ve been doing that too — I have written a bunch of code scaffolders for most of those tasks. The problem is — they never quite work for 100%, you always need to tweak something here and there (especially when you need scalable systems). And that still takes a lot of time from creating of actual data processing code.
The Solution — Exynize platform
To solve all of those issues, me and my colleagues at AKSW came up with idea of a platform that will take care of all that boilerplating. Delegating all the boring parts to the platform will allow developers to focus on the most important bit — data processing and visualisation.
Thus Exynize (short for “Extract, Synchronize, Analyze”) was born.
After a few very early prototypes that worked, but was not simplifying the workflow enough, we’ve came to the current version of the platform.
A platform that allows:
- constructing pipelines right in your browsers with very little effort,
- writing processing component as if you was dealing with a single data item,
- re-using existing processing modules in new pipelines,
- creating real-time processing and visualisation without thinking about doing real-time at all,
- spending time on doing actual work, not fiddling with scaffolding.
Sounds interesting? Then read on — I’m going to show you two simple demos and explain how the system works under the hood.
If you fancy video presentations rather than text, here’s my screencast with those demoes. If you like reading — just scroll down a bit, all of the bits in the video are covered in text below as well.
We’ll start with looking at the demo cases first and after that we’ll look under the hood of the platform.
But before that I want to mention that Exynize platform is currently made with javascript. Backend is based on node.js, express.js and RethinkDB, while front-end uses React.js and Twitter Bootstrap. All of that with sweet taste of Babel.js (so don’t be surprised to see ES6 code). Once again — we’ll go into more details on that after going through use cases.
Use Case 1: Twitter product comparison
For the first use case, let’s compare how people talk about three new smartphones (let’s say iPhone 6s, Nexus 6p and Galaxy S6) on Twitter. Here’s what we need to do:
- Take Twitter feed filtered by phone models, and English language for simplicity
- Calculate sentiments for the text of tweets
- Display the resulting sentiments and last 10 tweets in three-column layout
Let’s start by writing simple Twitter source component. Here’s how it’ll look:
Hopefully, the comments are enough to make sense of the code — it’s pretty straight forward.
There are three Exynize-specific things here:
- As you might’ve noticed, I imported NPM package “twit”. You are indeed allowed to use npm packages, but for the moment they are limited to whitelisted (for security reasons) subset of such. That might change in the future once I’ll figure out a better way to sandbox components.
- The main function is exposed using ES6 “export default”. This is a rule for all components. All the non-default arguments (so, all aside from “obs”)will be turned into input fields in the UI. But we’ll go into more detail on that in architecture part.
- Last parameter for the function — “obs” — is an Rx.Observer that is used to dispatch new data. It is a part of awesome RxJS library that is used to assemble the pipelines, we’ll also look into more details on it in architecture part.
If you are not familiar with Observables, for the moment just think of them as promises that can resolve more than one time.
As you can see, it’s pretty easy to create a reusable Twitter source in Exynize.
Now let’s create a sentiment analysis processor. We’re not going to do anything crazy here and just use a basic AFINN-based sentiment analysis.
Here’s how the source code will look:
It’s also pretty trivial. The only thing to note here is that we return Rx.Observable in the end. This is because all the processors are applied to source using “flatMap” method that expects Rx.Observable as result.
And for the last (and probably most complex) bit in this pipeline, we’re going to create a rendering component that’ll display the result for us as a nice web page. Note, that even without it we can already save the pipeline and send JSON type GET requests to it to get results as JSON (e.g. to use it in other tool).
The code for renderer will look like this:
The idea’s pretty simple — we create a function that returns new React component. Once pipeline is called from browser, Exynize platform will serve that component with a wrapper that does all the real-time data fetching. All you have to care about is “data” property that’ll be passed to your component.
Here’s how the result will look in the browser (see the video for more detailed walkthrough):
Use Case 2: BBC World News heat map
Twitter use case was very simple and didn’t really had any asynchronicity (aside from asynchronously incoming tweets, but that was handled by the platform). Let’s try to do something a bit more interesting and asynchronous.
For this case we’ll take BBC World News RSS feed, calculate sentiments for each article, then extract locations from text and render that on a map as red (negative article), gray (neutral article) or green (positive article) circles.
So, for the first step we need to create RSS feed reader. Here’s the source:
This is really straightforward and more or less a copy of tutorial for feedparser.
In the next step, we need to fetch full text of the articles since BBC RSS feed doesn’t provide it. The task on it’s own is pretty trivial, but it does require running asynchronous http request and waiting for response. Here’s how the code looks:
As you might note, the code returns Rx.Observable right away and only later on does actual fetching. As I mentioned before, you can think of Observables as promises that can resolve more than once (although here it does resolve only once — after the text was fetched).
Now that we have a full text, we can calculate sentiments for it. This time around we actually don’t have to write anything, because we already have sentiments processor from our Twitter use case. So, we can just plug it in — that’s where Exynize platform really shines — and get the results we want.
Now we have the last bit in processing — getting locations from text. It’s actually a two-step procedure: first we’ll have to extract entities using NLP tool, and then we’ll have to resolve extracted location names into coordinates.
Let’s start with extracting entities. For that I’m going to use REST API from the tool called FOX (Federated knOwledge eXtraction Framework) that was developed by my colleagues in AKSW. So, the code will basically look like a simple POST request, like so:
As you see, it’s pretty basic. I did take a small shortcut here — instead of processing resulting JSON-LD with a proper library, I simply mapped it into a simpler structure. For the sake of demo that works fine, but for real project it’d better to process it with something like jsonld.js.
Now that we have annotation that include location names, we need to get coordinates for them so that we can render them on a map. To do that we’ll use a tool from OSM called Nominatim.
This is going to be one of the trickiest bits of code you’ll see in this article. Mostly due to the fact that it has to run async nominatim requests for each annotation that has type “Location”. If you are familiar with RxJS, the code will look pretty simple though. Here it is:
If you are having troubles understanding that code — read on to architecture part, I’ll provide links that explain functional reactive programming in general and RxJS specifically in-depth.
And for the last piece — we need to create a render component that will draw all of this on map. I’d used a Leaflet.js library for that. Here’s how the code looks:
As you can see, we can easily inject css into pages by using “import” function. That’s because rendering components are assembled by webpack, so we can easily import style, fonts and even images.
Here’s how the result will look in the browser (see the video for more detailed walkthrough):
Exynize platform architecture
Now let’s talk about the architecture of Exynize platform and what makes it tick.
As I already mentioned — and as you might’ve noticed — everything in this project was developed using javascript. The reasoning here was pretty simple:
- Node.js provides a nice way to sandbox code
- React.js components can be easily written as standalone and integrated into complete app later (so, user components)
- The easiest way to use websockets (used for real-time data delivery) is from node.js
- There’s plenty of tools like esprima that can help shape the experience behind creation of component in the browser
- There’s ES6 and babel now, so javascript code can be actually pretty nice
The platform is currently split into two parts — REST API and single-page application front-end that communicates with it. All the requests must be signed with a valid JSON Web Token (which allows to easily do such requests from CLI).
REST API is supported by express.js and RethinkDB.
Front-end is build using React.js, Twitter Bootstrap for styles and webpack for assembling it.
But that’s all your typical boring javascript web applications stuff.
Now let’s talk about the interesting parts — components creation and testing, as well as pipeline execution and communication.
Components are written by users right in browsers, but must be executed on server. Which is quite scary if you think about it. Luckily, there are enough tools to help us here.
First of all, I’m using esprima in the browser to parses the component. That helps me figure out if the component is actually valid. It also tells me the arguments it needs. And whether that component type is a source (dispatches data using last argument as Observer), processor (transforms incoming data) or renderer (returns React component).
Testing and execution of components is done using aforementioned node.js VM module. Component is compiled into a function and then executed with given arguments — no magic here.
The interesting things begin once user starts assembling the pipeline. But before we start talking about that, you have to understand what functional reactive programming (FRP) is and how to do it using RxJS. I recommend reading this great piece by André Staltz: The introduction to Reactive Programming you’ve been missing
Once you’d grasped it a bit, we can talk about pipelines.
Internally pipelines consist of three different parts:
- Source component — can be only one, must use last parameter as Observer to dispatch values
- Processor components — can one or more, must return Observable
- Render component — can be only one, will be replaced to simple React JSON Tree if not given
The way the pipeline is assembled from those components is actually pretty straightforward, here’s some pseudocode that illustrates it:
The tricky bits are actually in communicating between the assembled pipeline runner (that is executed via forked node process, see VM package notes on security), REST and front-end.
For that purpose I stopped on using publish-subscribe approach by using repubsub.js from RethinkDB guys. That means that all the forked processes are always in contact with RethinkDB and are listening to commands from it. Potentially that means that we can move the forked processes to a completely different docker containers or even physical servers — as long as they are connected to the same RethinkDB cluster.
Those forked processes also write every processed bit of data into DB, so we can easily fetch everything from DB at any moment (e.g. with JSON request to pipeline URL).
But then again — that’s alpha, so I expect lots of those bits will change quite a lot over time.
Conclusion
Exynize platform seems to be shaping up quite nicely. Even though it’s alpha now, it already allows to do all those data processing tasks at least twice as fast.
There’s also lots of plans for the future — I want to allow creating, editing and testing components and pipelines directly from CLI (because web based editor will never be better than one you are using locally). I want to move those forked processes into new docker container (or VMs?). I want to change that simple “flatMap” into communication over the message bus that’ll allow simple horizontal scaling. And there’s many more things in my head.
Update: for people asking about project going open source — yes, we are planning to open source it at some point in near future. But first we want to take a bit of time to make it work.
Update 2: Exynize Platform is now on GitHub, with simple deployment options using docker-compose. You can find it here: https://github.com/Exynize/exynize-platform
Let me know what you think about it.
And feel free to request access to platform at http://alpha.exynize.com/
I’ll be happy to chat!