Processing a large dataset in less than 100 lines of Node.js with async.queue

Sara Robinson
Jun 12, 2017 · 3 min read

Sara is a Developer Advocate on Google’s Cloud Platform team, focusing on big data and machine learning. She helps developers build awesome apps through demos, online content, and events. When she’s not programming she can be found on a spin bike, listening to the Hamilton soundtrack, or finding the best ice cream in New York.

As a developer advocate at Google, one of my favorite things to do is explore new datasets. Node.js is usually my go-to platform for analyzing this data and building apps on top of it. Last week, I was working with a particularly large dataset of images and needed a way to send each image to an image analysis API and write the JSON response to a local file.

The problem? My first version of a script to process the images worked fine for ten images and even hundreds of images. But when I tried to run it on the entire dataset of 200,000 images I quickly ran up against the call stack limit. Luckily I sit next to Myles (resident Node.js expert and core team member) who recommended I try the npm package async, not to be confused with the language feature of the same name.

If you’re more of a skip to the code person, check out the gist here.

caolan’s async.queue to the rescue

To fix the call stack issue I needed to manage my API calls by pushing them into a queue where they could be processed in parallel. To be completely honest, having never worked with queues in Node.js before I was slightly intimidated by the thought of rewriting my script from scratch. But once I started queueing, I had something working in a few minutes (seriously) and my queue fears quickly disappeared.

Pushing items to the queue

My image IDs are in a newline delimited JSON file. First I convert this file into a JSON object using readFileSync. The object contains a list of image IDs and in my queue I want to send each image to the Vision API. The queue takes a task (in this case my object of image IDs) and a callback function, called when the worker is finished processing:

q.push(imageIds, function (err) {
if (err) {
console.log(err)
}
});

Defining the queue

The queue takes a function and a concurrency number as parameters. Let’s start with the function: we pass it a task (our image ID from above) and a callback, which will be called when the worker completes a task. Inside the function is where I do my image processing. This function should return some JSON about the image which I want to write to a local JSON file. I’ll define that in the next step.

Concurrency tells Node.js the maximum number of workers to process our task in parallel. I played with this number until I found a balance of something that wasn’t too slow, but also didn’t result in API limits or call stack errors. The number will vary depending on what you’re doing, so it’s definitely ok to fine tune it by hand until you find your “magic number.” Here’s my queue:

let q = async.queue(callVision, 20);

Processing images

Last, it’s time to write the callVision() function referenced above. This part isn’t exactly async.queue specific, but it’s still important because it’s the meat of my queue task. Here I’m using Google’s Cloud Vision API for image analysis, and I use the Google Cloud Node.js module to call it. Once I get a JSON response for each image, I create a JSON string of the response to write to a newline delimited JSON file (I’m using this format because it’s what BigQuery expects, which is where I’ll be storing the data eventually). Once this function completes, the data is sent back to the queue where it is written to my local JSON file. You can find all of the callVision() code in the gist.

That’s it! Let me know if you’ve done something interesting with async.queue in the comments, or find me on Twitter @SRobTweets.

Node.js Collection

Community-curated content for the millions of Node.js users.

Sara Robinson

Written by

Connoisseur of code, country music, and homemade ice cream. Helping developers build awesome apps @googlecloud. Opinions = my own, not that of my company.

Node.js Collection

Community-curated content for the millions of Node.js users.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade