How to Visualize a Real Time Data Stream: What Miley Cyrus Can Teach Us
At Conductor, we collect a lot of data from and about the web, and it won’t come as a shock to you that sometimes we may even want to see that data. However, when your web crawlers are generating gigabytes of data per minute, it’s hard to create visualizations of the entire data set they produce, let alone get insight into the individual objects that are flowing through our systems at any given time.I was thinking a lot about this problem one day and wondered if I could come up with a solution. Specifically, I was interested in which one of our customers’ websites we had just crawled as part of our continuous process of generating recommendations to help them improve their visibility in organic search. I wanted to have a simple visual display of the stream of URLs fetched by our crawlers that would make it easy for someone to see what the system was up to at any given time. Beyond being visually interesting, it would enable them to determine that our backend systems were functioning and producing useful pieces of data. So I decided to write it.I spend my days at Conductor building data processing pipelines in the quiet, familiar, typesafe world of Java, but I created this tool using Node.js and some related browser-based technologies. I’ll walk you through the process of building in this post, with a few instructive Miley Cyrus GIFs as an added bonus.
Here’s the problem: We want to see fresh data as soon as it’s available, but we don’t know when that’s going to be. Traditionally, when a user of a web application wants to get updated data, they need to request it explicitly (for example, by refreshing the page). A page refresh causes a user’s browser to discard the already-rendered page and make a new request to the web server. In turn, the web server needs to check an underlying data source for new information and return it to the client.
The downside of this approach is that it makes the user responsible for requesting new data, and it unnecessarily ties load on the server to user behavior rather than the volume of updates.A more scalable pattern for frequently-updated data is instead to have clients navigate to an application once and then have the server push updates to the clients as they arrive. This requires that each client create a persistent connection to the web server through which it receives updates, but the user doesn’t have to do anything else, which is great if it’s difficult to predict the frequency of updates. (Can’t trust people, amirite?)
So what are the ingredients for building this type of streaming web application? First of all, we need to figure out how our application’s going to receive updates from our crawlers when they have a new piece of data ready. We could modify them to send data to, say, an internal web server on a POST endpoint but that would require the crawlers to write the same data several times, likely decreasing their overall throughput.As I mentioned above, our web crawlers already output a stream of collected HTML pages. They do so by writing to Apache Kafka, a distributed, persistent message queue. Kafka supports multiple concurrent writers, as well as multiple groups of readers that maintain their own offsets within the queue (which Kafka calls a ‘topic’). This enables us to build applications that consume data from a topic at their own pace without disrupting access from other groups of readers. It’s a great technology for creating asynchronous data stream processors, which is exactly what we’re going to do here.In order to read that Kafka stream, though, we’re going to need some technology that breaks the rules of traditional web applications. First we’re going to use…
Blending Node.js and Socket.IO to publish a data stream
…Node.js. I know, right? Gross. You’ve probably heard of Node.js as a way of building server-side applications in JavaScript, but it’s also got a very efficient asynchronous execution model that allows it to move large volumes of data to large numbers of connected clients at very low latencies — perfect for broadcasting a data stream to a group of connected browsers.The other technology I chose to use was Socket.IO, a JavaScript library that provides a consistent API for creating persistent, browser-to-server socket connections across a wide variety of browsers and transports (including Flash and WebSockets). Socket.IO integrates well with Node.js, making it easy to listen for events that arrive at the browser on a socket attached to a Node.js endpoint — without refreshing the page.Like I said earlier, I write Java code. So why did I pick this stack? Because, as Miley might say, this is our house, this is our rules. Or, to put another spin on it, because both of these technologies are well-suited to building a pipe that allows messages to flow all the way from our crawlers to the browser on my desktop.Hopefully, you can see how the ingredients for this technology panini are coming together now. When our Java-based page fetchers finish collecting a page and queue it to Kafka, a Node.js server listening on the topic will pick up the message and broadcast it over Socket.IO to all connected clients so they can see the URL we just crawled. There you have it: real time streaming data with Node.js, Socket.IO, and Kafka.
Let’s go look at the code.
Creating a lean, mean Node.js streaming machine
The first thing we need to do is create the skeleton of our server application in Node.js server. Don’t worry; it’s going to be super simple, thanks to Express, a nice web application development framework for Node.js that provides HTTP request routing so that we can create a RESTful endpoint for our application. Our clients will hit this endpoint and then… do nothing else. At all. Ever. In the mean time, Socket.IO will bring them all the data they need.Our app.js file (which defines our main application entry point) looks like the following:
var express = require('express');var app = express();app.get('/', routes.index);var server = http.createServer(app);server.listen(app.get('port'), function() { console.log('Express server listening on port ' + app.get('port'));});var io = sio.listen(server);
This creates our index route, starts the server, and listens for new Socket.IO connections. All we need our index route to do is serve up our index page to the client. So in our routes/index.js file we’ll have:
exports.index = function(req, res) { res.render('index', { title: 'Super awesome streaming data app #itsthebest' });};
Delivering the data to the client
Now we have a place for browsers to go to get the index page. In addition to a basic HTML layout, this page will include some JavaScript to process new messages that arrive over a Socket.IO connection from our server. Let’s take a look at what we need to initiate that connection and handle the messages.
(function($) { var socket = io.connect('http://localhost:3000', { transports: [ 'websocket', 'flashsocket' ] }); socket.on('newData', function(data) { $('#newData').text(data.url); });})(jQuery);
All this JavaScript does is establish a connection back to our Node.js server and register a listener for the newData event. When it receives this event, it’ll update the contents of the newData div on the index page with the text of the URL we just processed.
Feeding the data stream from Kafka
Now we have a channel through which we can push data from the server to the browser. So what about that other part — having our Node.js server listen on a Kafka queue for new messages? Luckily, there is a Node.js library called franz-kafka that offers access to Kafka topics. (Get it?) It’s one of several Kafka-Node.js integrations out there, but franz-kafka’s API exposes the topic as a Node.js Stream object, which allows us to propagate events to clients without doing explicit polling in application code.First off, we need to create our Kafka configuration:
var Kafka = require('franz-kafka')var kafka = new Kafka({ brokers: [{ id: 1, host: 'kafka-01', port: 9092 }, { id: 2, host: 'kafka-02', port: 9092 }, { id: 3, host: 'kafka-03', port: 9092 }], compression: 'gzip', logger: console, batchSize: 1})
We set our Kafka broker hostnames, compression codec, and consumer group. Now we need to connect to our Kafka cluster and start consuming data from the topic:
kafka.connect(function () { topic = kafka.topic('collected_data_topic', { compression: 'gzip', minFetchDelay: 4000, maxFetchDelay: 8000, maxFetchSize: 5 * 1024 * 1024, partitions: { consume: offsets } }); topic.on('data', sendData); topic.on('error', function (err) { console.error("STAY CALM") }); topic.resume();})
This piece of code establishes the connections to the Kafka cluster and then creates a new consumer listening to the topic called collected_data_topic. One thing it doesn’t do is tell you where in the topic to start reading. franz-kafka exposes Kafka’s so-called “Simple” consumer API, which means that you have to manage the current message offsets of your consumer group yourself. How best to handle this for your application is something I’m going to leave for you, brave pioneer of real time applications, to figure out. There are many ways to do it: You can use ZooKeeper, Redis, local files, etc.When we have a new data message on the topic, the data event will be fired, at which point we’ll call the sendData function.
function sendData(data) { sio.sockets.emit('newData', data);}
All this function does is publish the newData event to all clients connected via Socket.IO with the data attached to the message. That should be pretty much all the pieces we need. Clients connect to our Node.js server via Socket.IO, the server listens for new data on Kafka, and when it receives updates, it broadcasts them to all connected clients. None of those clients will ever need to click ‘Refresh’.
We built our data stream: Go eat some chocolate chip waffles
This is a simple, fun example, but demonstrates how to use these technologies to create real time applications driven by a continuous data stream. I used this technique to create a visualization of an internal system but it could easily be applied to a reporting application like Searchlight to show users changes to their data in real time.