Breadth-First and Depth-First Search with Asynchrony

When I first started really digging into advanced Comp Sci topics, graph theory and recursion were two that I felt like I’d never get. Like most things, though, it was just a matter of time and practice before I felt at least a little more comfortable with them.

Today I worked through an especially tough challenge. Here it is:

Write a Node.js program that, given a relative directory path, will return a list of all files in all folders and subfolders inside of that directory.

For example, imagine a directory tree:

root
|--temp
| |-- foo
| | |-- a
| | | +-- Hello.txt
| | | +-- World.txt
| | |-- b
| | | +-- Dude.js
| | |-- c
| | | +-- Sweet.js
| | +-- hey.js
| |-- bar
| | +-- baz.js
| +-- ho.js
+-- dump.js

If we run the command in root:

node dump.js temp

We’ll get an output of all the file names in all the subdirectories of temp:

[
'Users/Me/root/temp/foo/a/Hello.txt',
'Users/Me/root/temp/foo/a/World.txt',
'Users/Me/root/temp/foo/b/Dude.js',
'Users/Me/root/temp/foo/c/Sweet.js',
'Users/Me/root/temp/foo/hey.js',
'Users/Me/root/temp/bar/baz.js',
'Users/Me/root/temp/ho.js',
]

There were some caveats:

  1. Code should work for large directories with many files
  2. Use ‘fs’ module’s non-blocking variants (e.g. readdir instead of readdirSync)
  3. Use of the caolan/async library is allowed
  4. Order of results does not matter

Async queue and BFS

There are two methods that I know of for approaching a walk of the file system — Breadth-First Search (BFS) and Depth-First Search (DFS). BFS will visit every node (a file or directory) at each level of the directory before moving down to the next level. On the other hand, DFS will drill down into a single directory until our base case is fulfilled (a file) before continuing to the next node in the level. DFS is great for visiting every node in the graph. BFS is useful for finding the shortest path between nodes.

I came across a tricky snag when trying to implement DFS, and it led me to stumble upon a BFS solution first. Here it is:

const path = require('path')
const fs = require('fs')
const async = require('async')
require('colors')
const basepath = process.cwd()
const filepath = process.argv[2]
// create a queue object with concurrency
let log = []
let q = async.queue(function (task, callback) {
let node = path.join(task.basepath, task.filepath)
console.log('Processing node: ', node)
fs.stat(node, (err, stat) => {
if (err) callback(err)
let message = stat.isFile() ? 'FILE! Adding to the log'.blue : 'DIRECTORY! Adding to the queue'.red
console.log('It is a: ' + message)
if (stat.isDirectory()) {
fs.readdir(node, (err, items) => {
if (err) callback(err)
q.push(items.map(item => {
return { basepath: node, filepath: item }
}), function (err) {
if (err) callback(err)
console.log('Finished processing an item in: ', node + '\n')
})
callback()
})
} else {
log.push(node)
callback()
}
})
}, 1)
// assign a callback for empty queue
q.drain = function (err) {
if (err) {
console.log('There was an error processing:', err)
return
}
console.log('All nodes have been processed. Files: \n', log)
return
}
// Start the search by pushing the first directory into the queue
q.push({ basepath: basepath, filepath: filepath }, function (err) {
if (err) {
console.log('There was an error starting up:', err)
return
}
console.log('Finished processing of start node: ', filepath + '\n')
})

There’s quite a bit going on here so let’s break it down. If you’re following along, make sure you npm install the dependencies at the top.

1. Create an async queue object

This is where all the magic happens. BFS is not recursive. It’s implemented by pushing (enqueueing) nodes into the queue for processing, and then removing (dequeueing) them as they are processed. Async queue here helps us by defining when the processing of a node is complete.

The task that we pass to queue for processing contains a base path and the current filepath. For each node that we process, we’ll determine whether it’s a file or a directory with fs.stat. If it’s a file, we’ll go ahead and push the filepath string into our log. If it’s a directory, we need to read it and enqueue the contents for later processing. Finally, no matter what operation occurred, we’ll tell the queue that we’re finished processing the node by invoking the callback, which will dequeue the task.

One interesting thing to note is that we’re calling q.push with an array of new nodes. It’s effectively the same as enqueueing each node individually, but async queue provides a convenient interface here.

2. Make the final callback function

Once all nodes have been processed and the queue is empty, async queue automatically invokes the drain method. When this function fires, we know that all our our nodes and children have been processed and can return the log of files.

3. Kick start the process with the first enqueue

To begin processing our nodes, we’ll pass the argument to dump.js as the first node/task for async.queue to process.


Asynchronous recursion and DFS

While this solved the challenge, I really wanted to figure out the DFS approach. First things first though — implement DFS with synchronous recursion.

// Synchronous
function printFiles (basepath, filepath) {
let results = [];
function recurse(base, file) {
let node = path.join(base, file)
let stat = fs.statSync(node)
if (stat.isDirectory()) {
let items = fs.readdirSync(node)
items.forEach(item => recurse(node, item))
} else {
results.push(node)
}
}
recurse(basepath, filepath)
return results
}
console.log(printFiles(basepath, filepath))

Pretty straightforward with fsSync methods. Our base case is if the node is a file. Otherwise we’ll recurse over the child nodes in the file directory. Finally, we’ll return the results array. Since all operations are synchronous, no callbacks are needed.

Once I had this working, I figured it’d just be a matter of replacing synchronous methods with async ones and adding callba… wait, where does the callback go exactly??

// ASYNC
function printFiles (basepath, filepath, cb) {
let results = [];
function recurse(base, file) {
let node = path.join(base, file)
fs.stat(node, (err, stat) => {
if (stat.isDirectory()) {
fs.readdir(node, (err, items) => {
items.forEach(item => recurse(node, item))
})
} else {
results.push(node)
}
})
}
recurse(basepath, filepath)
}
}
})
}
recurse(basepath, filepath)
}
printFiles(basepath, filepath, function(result){
console.log(result)
})

If you can figure out where it should go, please let me know. This took awhile and lot of frustrating console logs to even see what was going on. Basically on each call to recurse, we should be adding another call to the stack. The problem is that these asynchronous calls don’t ever return anything immediately. In our synchronous version it doesn’t matter — undefined just gets swallowed up and we only ever push truthy filenames into our log. There’s no way to know when our recursion has completed and it’s safe to return the log.

Or is there…

Counters to the rescue! (kind of)

Well it turns out there IS a way to know when recursion has completed with asynchronous operations. If we count the number of calls to recurse, and each time we complete a recursive call decrement the counter, when the counter reaches 0, we’re done.

// ASYNC
function printFiles (basepath, filepath, cb) {
let count = 0
let results = [];
function checkDone(){
if (count === 0) cb(results)
}
function recurse(base, file) {
count++
let node = path.join(base, file)
fs.stat(node, (err, stat) => {
if (stat.isDirectory()) {
fs.readdir(node, (err, items) => {
items.forEach(item => recurse(node, item))
count--
checkDone()
})
} else {
results.push(node)
count--
checkDone()
}
})
}
recurse(basepath, filepath)
}
printFiles(basepath, filepath, function(result){
console.log(result)
})

Now on each call to recurse, we increment the counter. Every time an operation has completed, we decrement the counter and see if it’s reached zero. When it does, we know that we have called recurse as many times as we needed and can safely return the log to the callback.

Wait a minute… this looks really familiar…

Yup. Processing nodes, adding calls to the count, notifying our calling function when we’re done, and flushing the results when all nodes have been processed — it’s the same as before with our async queue. In fact, when we run it we’ll get the results in the order that we expect for a BFS.


Final Thoughts

So I still don’t know how to implement DFS on a file directory using the non-blocking fs methods. My attempt to do so ended up producing another implementation of async queue’s BFS. I did learn a ton about recursion, fs methods, queues, trees, and asynchrony though!

If you have a solution for the asynchronous recursive DFS problem please let me know! Also, if you have any other comments or questions leave them below and I’ll do my best to answer them!

Patrick Shaughnessy

Written by

Full Stack Javascript Developer. Student of life.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade