Grab your hardhat!

Data streaming, foreman and crew members with Node.js, Angular and Redis.

Kyle
7 min readJan 19, 2017

This is part 28 of my Node / Redis series. The previous part was I’m mod for neural networks (Chapter 2).

Let’s say you are constructing a building. You hire a company to help you. The first to arrive at the site is a foreman. The foreman’s role on a construction site is to make sure that the rank-and-file crew members have something to do and what they are doing contributes to the final goal (a building). The foreman also manages the materials as they arrive. Things like lumber, plywood, wallboard, beams — you get the idea. The foreman can’t assign a crew member to put up the wallboard if it hasn’t arrived yet. So, at this point, the foreman is managing workers (a resource) and materials (an input).

Crew members need to keep the foreman informed of how much capacity they have and the foreman must insure that the work is distributed fairly among the workers. Let’s say the foreman walks around and checks-in with the crew members as they are doing their jobs. Distributing the work evenly makes handling new tasks easier and the work move faster since you are utilizing more hands. The crew members tell the foreman when they have finished a job so the foreman can mark it off the list.

The foreman needs to be flexible too. On our construction site, though, not all is perfect. Some crew members are unavailable — for good or bad reasons. Here are some reasons:

  • Lunch break,
  • Diverted to another job,
  • Injury.

Let’s introduce another role — the manager. The manager can monitor the work site to make sure that the foreman is distributing work properly and that there are enough crew members. We don’t want to add any more work to the foreman to report to the manager, so maybe the manager walks around too and listens to the reports that crew members give the foreman.

As you might have guessed, I’m not randomly writing an article about a fictitious, abstracted construction site, but rather I’m writing about Redis. The construction site is a metaphor for a project I’m working on that involves real-time processing of tweets. In our metaphor, tweets represents materials coming into the work site. The foreman is a Node.js management process and crew members are distributed worker processes. The manager is actually a human monitored dashboard.

This project uses pub/sub in Redis to stream the data back and forth between the foreman and the crew members. I’m using the CPU load of the crew member processes to determine where the work should be sent.

The Foreman

The foreman is a headless process that takes the metaphorical materials (tweets that contain a particular keyword or hashtag) and doles them out to the resources (the crew members processes)

Publishes

The foreman handles the incoming tweets from the Twitter API and publishes them to specific crew members. It goes out looking like tweet:processname. The crew member is published to the process that has the least CPU load (see below).

Subscribes

Since the Foreman needs to know the load of each crew member, it subscribes to a pattern (PSUBSCRIBE) that looks like load:processname (“load:*”). The payload of this message is the CPU load of the process. The Foreman process sorts the list of available crew members by least busy. The timestamp of the load report is internally noted.

The Foreman also subscribes to a pattern that indicates completion response:processname (“response:*”).

Periodically, the foreman cleans out the list of available crew members that haven’t reported back in a certain period of time.

The Crew Members

Each crew member has a unique process name that is set via the command line arguments. You can run multiple crew member processes on one machine or run them on distinct hardware in remote locations.

For our example, I’m simulating the load by actually reporting the CPU load plus a modifier that is increased every time it starts to “process” a tweet. That modifier is reduced after a fixed amount of time (using setTimeout)

Publishes

Crew members read their own PID’s CPU load every few milliseconds (setInterval) and publish them like load:processname with the payload being the percent of the CPU used.

When they are done processing a tweet, they publish a message response:processname. The payload is the processed value.

Subscribes

Crew members subscribe to the pattern that looks like tweet:processname. This triggers the crew member to start processing a tweet.

The Manager

The manager is a process that only subscribes. It converts the messages flying back and forth to something that a human can understand — a flow of content and a illustrative graph. The graph displays the CPU load of the available crew members (by subscribing to load:* with PSUBSCRIBE), incoming tweets (tweet:*) and to completed processing (response:*). This illustrates the beauty of the pub/sub system — instead of using some other client/server messaging (point to point, as an example) we can effectively listen in on these messages as they fly between the foreman and crew members.

These messages are proxied to a Node.js event emitter. The events are listened to by any number of web socket connections (which are inherently statefull). Having an intermediary event emitter in the application level code gives us a couple of advantages.

  1. We don’t need to worry about the complexity and congestion of subscribing and unsubscribing at the Redis level.
  2. As web socket connections come and go, we always have a consistent interface for messaging.

On the client side, we’re using Angular and a web socket connection. The connection listens for messages and converts the CPU data into a format that angular-chart/chart.js can easily render. For all the messages, we’re slicing the arrays to only display the last n number of items, letting the client keep the volume of data manageable.

Putting it all together

First, grab the repo on github. Then to get this demo up and running, you’ll need several terminal windows or something like PM2 as you’ll need to keep everything running at one time.

First, start by running the foreman:

$ node foreman.node.js --connection ../path/to/your/connection.json --twittercreds ../path/to/your/twitterapicreds.json --terms terms,separated,by,commas

This example requires that you have access to the Twitter Streaming API which requires credentials. You can sign up for a free developer’s account.

Now, start the manager:

$ node manager.node.js --connection ../path/to/your/connection.json

Then launch a crew member:

$ node crewmember.node.js --processname someonewordstringtoidentify --connection ../path/to/your/connection.json --loadfreq 300

The loadfreq can be any number of milliseconds less than the inactiveThreshold in foreman.node.js (500 as default).

These scripts really shine when you load multiple crew member processes, so load a few instances of crewmember.node.js (each with a different processname).

Once all that is done, you’ll end up with a web server at http://localhost:8859/ . You can adjust the buttons at the top to change the Y axis (aka scale). You should end up with something like this:

This was captured using the term “merylstreep” the day after the Golden Globes speech. I picked this because it was trending, so apologies on any content — that’s unfiltered Twitter in the screenshot. Anything with a high volume on Twitter is a good example of this script.

But what about…

This is not a panacea. This implementation is fairly vague in a few ways.

Dropped messages
Redis’ pub/sub system does not guaranteed a delivery. In several scenarios it’s possible for a published message to be dropped:

  • The process dies/is killed during processing of a message
  • The process is assigned a message after the process is killed, but before it fails to report in a CPU load.
  • The network drops out after a message is published but before the CPU fails to report load.

In my circumstance, perfect completion isn’t important. If you’re doing something similar, there are many queuing systems that have very carefully designed processes to avoid the issue above — re-inventing the wheel is never a good idea.

Blocking and CPU bound tasks
It’s actually fairly unlikely that you’ll get a high-CPU report in this scenario. If Node.js is available to send out the CPU load, then it isn’t being blocked, so the CPU load value will be low. I’m using CPU load as more of an example — you’re more likely to run into situation of high memory usage than CPU in a single-thread Node.js process — that might be a more relevant metric. You, however, could use any type of metric you want, open files, fan speed, whatever — that bit is less important — this is more an example of a topology that spreads the load out to accommodate real-time processing.

Self management
Our example has a manager as an actual human being watching a graph, but you could sub-in a heuristic or algorithm that creates and de-commissions processes as needed. Or both. That’s really up to you where you take it.

Why go to all this trouble?

The particular problem I was presented with is how to effectively manage processing something in real-time. With data like tweets, you can go from a trickle to a tidal wave in a small span of time. As an example, if you were monitoring a keyword that is normally mundane and slow moving, “envelopes” comes to mind. Under normal circumstances, this might get a tweet every few days. Let’s say there is a political scandal involving envelopes, suddenly the tweet volume will go crazy. If you had a fixed infrastructure of processing scripts, you would either over devote resources (and thus, money) or have long queues that take forever to process through during the flood of tweets.

This topology allows you to add or re-arrange your available resources when you need it and shrink back down once the need has passed. Heck, if you’re using a cloud-based Redis provider (like RedisLabs, where you can get a free 30mb instance), you can even dynamically switch on the processing power of your own laptop, a cluster of Raspberry Pi’s, a closet full of PS3’s running Linux… anything that can run Node.js and has access to the internet.

--

--

Kyle

Developer of things. Node.js + all the frontend jazz. Also, not from Stockholm, don’t do UX. Long story.