How to build a sitemap with a node.js crawler and D3.js (Part 1/2)

When pitching for a new website job or when a company wants to relaunch their website, one of the most important part of the new concept is the information architecture. Therefore a sitemap of the existing page is very helpful to get a complete overview. But before someone extracts all links and hierarchies manually, we can use NodeJS to crawl the page for us.

Babette Landmesser
May 31 · 6 min read

In this part of a two posts series we’re going to build a NodeJS crawler which will save all internal links that are found on a webpage into a json file.

We want to start the crawler by a command in our terminal and the domain should be an argument. This ensures that the crawler will be usable for every website we want. Next, it should only crawl pages that it has not seen before. Otherwise it would run forever and it’ll never stop.

Setup

In your terminal, create a new project folder and navigate into it.

$ mkdir crawler && cd crawler

Run npm init and fill in your values. The entry point index.js is perfect for our case.

Next, install the main dependencies:

$ npm i — save-dev cheerio request fs

These three packages will be the only ones for the crawler.

  • cheerio will help us read the DOM and extract information.
  • request will open the website
  • fs will obviously store the URLs in a json file

Now, add the following lines to your index.js

// load all our dependencies
const cheerio = require(‘cheerio’);
const request = require(‘request’);
const fs = require(‘fs’);
// get the domain name from our command $ node . mydomain.com
const domain = process.argv[2];
// initialize the local variables
const crawledPages = [];
const crawledPagesData = [];
let foundPages = [];
let index = 0;

In crawledPages we’re going to push all URLs that the crawler already visited. This will later make sure it only visits every page once. crawledPagesData will represent the JSON which the script will put into the JSON file at the very end. foundPages will hold all URLs that the crawler finds. And finally the index will help to loop through foundPages.

Crawl Function

Add a new function to your index.js

crawl = async () => {
// here goes the magic
}

As the start point for the crawler is similar to the domain given via arguments, we’re going to set the first foundPages entry to “domain + /”

// if it's the first start
if (index === 0) {
// use / as first page.
foundPages.push(domain + '/');
}

Next, the page to crawl is the indexed entry of foundPages. As the index is still 0 and we just added the first element to the array, it’s going to crawl the front page first.

Before we start crawling we need to check if this page was already crawled before.

// if pageToCrawl is not yet in list of crawledPages
if (crawledPages.indexOf(pageToCrawl) === -1) {
if (pageToCrawl) {
new Promise(resolve => {
// visit the page
visitPage(pageToCrawl, resolve);
}).then(function() {
index++;
crawl();
});
} else {
process.nextTick(crawl);
}
} else {
// go to next crawl
process.nextTick(crawl);
}

Why do I work with promises? The answer is easy: With promises I can ensure that asynchronous calls ended before the crawler continues. Especially the request to the URL is async which means the crawler should wait for that to happen.

process.nextTick(callbackFunc) is needed to avoid running into node timeouts. You can try the code without waiting for the nextTick but for larger pages it may not work.

Visit the page

Create a new function to actually visit the page. You can also put all the code inline but I’d recommend to separate the crawling and the actual visit and DOM extraction.

visitPage = (url, callback) => {
// requesting the page and extracting the links
}

Here, we’re going to actually visit the page by requesting it. We need to check if the page is available and the request returns status code 200. If not, we’re restarting the crawler with the next index.

// Check status code (200 is HTTP OK)
if (!response || response.statusCode !== 200) {
process.nextTick(callback);
return;
}

If the response is 200, we add the url to crawled pages and an object with page information to crawledPagesData.

// Add URL to crawled Pages
crawledPages.push(url);
crawledPagesData.push({
name: url
});

I do use an object for crawledPagesData here that only holds the URL for now. But in case you’d like to extend this crawler by for example getting the page title, this would be the easiest way to add it here. Also, be aware that we reuse this structure in part 2 of this series.

Next, cheerio can load the body and we can implement to collect all internal links. Cheerio is working similar to jQuery which is why it’s mostly used with $ as constant name.

// Parse the document body
const $ = cheerio.load(body);
// collect all links
collectInternalLinks($, domain, foundPages).then(
(newFoundPages) => {
foundPages = newFoundPages;
callback();
});

Again, I am using a dedicated function for that which would return the adjusted foundPages array with the new entries already added. As this code snippet is part of the visitPage function, it calls the callback function afterwards to state the end of this process.

Collect Internal links

First, create a new function that accepts cheerio, the domain and the foundPages array as parameter. We’re expecting a promise as return value and the foundPages array as well.

collectInternalLinks = ($, domain, foundPages) => {
// collecting starts
return new Promise(resolve => {
resolve(foundPages);
})
}

So, how to find all internal links? Well, the function already gets the domain as a parameter. So we can extract all href of all link tags that include the domain. But this may also include links to subdomains or to social media networks in case the account is called the same as the domain. Plus, all mailto links may be included as well. To summarize: too much information.

Therefore we’re going to only allow links that either start with http, https or with a slash (/) directly. Unfortunately, as not all website admins do redirects, we also need to include links with www and without.

Cheerio is super easy and the selectors work similar to CSS selectors which makes a string like:

const elements = “a[href^=’http://” + domain + “‘], “ +
“a[href^=’https://” + domain + “‘], “ +
“a[href^=’https://www." + domain + “‘], “ +
“a[href^=’http://www." + domain + “‘], “ +
“a[href^=’/’]“;
const relativeLinks = $(elements);

I think, that covers all cases.

Now, loop through all links and extract the href. Also, we need to remove the “http(s)://(www)” stuff again before we can add this url to our foundPages array. Lastly, we need to check if this URL is already there and if so, skip it.

relativeLinks.each(function(i, e) {    let href = $(this).attr('href');    if (href.indexOf('www.') !== -1) {
href = href.substr(href.indexOf('www.') + 4, href.length);
}
if (href.indexOf('http') === 0) {
href = href.substr(href.indexOf('://') + 3, href.length);
} else if (href.indexOf('/') === 0) {
href = domain + href;
}
// only add the href to the foundPages if it's not there yet.
if (foundPages.indexOf(href) === -1) {
foundPages.push(href);
}
});

Finally, resolve the promise with the foundPages as return value.

In case you know a prettier way to remove the http(s)://(www) things, let me know 😉

Okay, we’re nearly done. The very last part to add is our exit condition. Go back to the crawl function and after initializing the pageToCrawl we need to check if the foundPages array is similar to the crawledPages array or if the pageToCrawl is undefined — which would mean that the index is not available in foundPages.

Then we write the crawledPagesData into a json file and exit the process.

// exit the process if both arrays are the same or the next page is not definedif (foundPages === crawledPages || !pageToCrawl) {
// stop
fs.writeFileSync(
'urls_'+domain+'.json',
JSON.stringify({ data: crawledPagesData }),
function (err) {
if (err) throw err;
}
);
process.exit();
}

Perfect, you’re done. 🥳

In the very last line of your index.js call crawl(); to actually start the crawling once the entry point of this little node script is triggered.

In your terminal run

$ node . [YOURDOMAIN]

and see which URLs the crawler finds. In case you’re worried about no output, you can add the following line to the very beginning of visitPage function:

process.stdout.write(crawledPages.length + ‘/’ + foundPages.length + ‘\n‘);

This will keep you updated on how many pages were found and how many were already crawled.

Read Part 2.

You can check the complete code on this gist.

Create & Code

UX & IT

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store