How to build a sitemap with a node.js crawler and D3.js (Part 2/2)

In this second part of the series we’re going to use the crawler output from part 1 and end up having a tree sitemap in our browser, in an index.html.

Babette Landmesser
Jun 1, 2020 · 6 min read

This tutorial is based on this d3.js example.

Prerequisites

File Setup

Create an index.html, main.js and a data.js. That’s all that is needed.

Fill the index.html with this few lines of content

<!DOCTYPE html>
<html lang=”en”>
<head>
<meta charset=”UTF-8">
<title>Tree</title>
<script src=”https://d3js.org/d3.v5.min.js"></script>
<script src=”build/main.js”></script>
</head>
<body>
<img>
</body>
</html>

You see, we’re importing d3js from the original source. Also, don’t worry about the path to the main.js file because as we will compile es6 with webpack, the compiled file will then be stored in the build folder.

Inside of the body we only need an image tag to fill it with svg source later and to be able to download the sitemap.

For the data.js file, paste the output of the crawler into a javascript object like this:

export default {
data: [
{ name: … }
]
}

In the main.js we first need to import our data and set our final svg width:

import source from ‘./data.js’;const data = source.data;const width = 1440;

In your console type

$ webpack main.js -o build/main.js -w

and your main.js is being watched for changes and automatically compiled to build/main.js.

Data Preparations

First, we need to prepare the data, find out which elements are child elements of the main domain and which are located below another child. Create a function that will process our data:

const processData = (data) => {
const base = data[0].name;
// assuming, the first entry of data is the parent of all
const newData = {
name: base,
children: [],
};
};

This will create the base of the tree later. Also, we assume that the very first entry of our data array is the base url of the website that we crawled. When remembering the crawler code, this is absolutely logical because we only inserted the domain and added a slash as the first entry of the output. So in case you followed the first part of the tutorial, this assumption should be fine for you.

Our newData object will contain the name of the base and — so far — an empty array with children. Next, we’re going to loop through all links in our data array and sort them into children or even grandchildren.

Short explanation of the upcoming steps. A URL consists of “domain — slash — path — slash — path” and so on. So our divider for finding children and grandchildren can easily be the slash. As we assume our base is the “domain + slash”, we’re going to remove this part for finding the children. Hence, the path to check will always start with letters: domain.com/my-products is going to be my-products. domain.com/my-products/awesome-product is going to be my-products/awesome-product. This makes it easier to sort it into children and grandchildren.

data.forEach(({ name }) => {
let path = name;
if (name.indexOf(base) !== -1) {
path = name.substring(base.length, name.length);
}
if (path === ‘’) {
return;
}
});

Next, we need to check if there’s still a slash in the path. If not — it’s a direct child of the domain, if yes — it includes grandchildren.

if (path.indexOf(‘/’) === -1) {
if (!newData.children.find(child => child === path)) {
newData.children.push({name: path, children: []});
};
} else {
// the grandchildren processing
}

Here, we’re going to push the whole dataset entry to the children array for not losing any information. But we’re going to change the name of dataset entry to only store the path. This is relevant for the tree diagram later to only show the paths instead of full URLs.

For the grandchildren, we need to cut the path again in parts, finding out which path is the parent path.

const parent = path.substring(0, path.indexOf(‘/’));

Then we need to check if this parent path is already stored inside our children array. and if not, we need to add it — because we don’t know if this path will be added automatically later. For example if there is no complete product overview on the page but all products are stored as /products/product.

let parentObj = newData.children.find((child) => child.name === parent);if (!parentObj) {
let quickParent = data.find(d => d.name === base + parent ? parent : undefined);
if (!quickParent) {
quickParent = {
name: parent,
}
}
const parentObjIndex = newData.children.push(quickParent);
parentObj = newData.children[parentObjIndex-1];
}
path = path.substring(path.indexOf('/') + 1, path.length);if (parentObj.children) {
parentObj.children.push({name: path});
} else {
parentObj.children = [{name: path}];
}

Here we check for the parent of the current path. If the parent was already added to the newData children array, then we’ll use this object. Otherwise we will check in our original data to find an object with the parent path or else create a new parent. Either way, we push the current path to the children of the parent object.

After the whole data loop, simple return the new data.

return newData;

Generating the tree

Now, we need a function that uses the d3 hierarchy logic and generates the sizes for our svg.

const tree = (treeData) => {
const root = d3.hierarchy(treeData);
root.dx = 10;
root.dy = width / (root.height + 1);
return d3.tree().nodeSize([root.dx, root.dy])(root);
};

At this point, it’s important for all that to work, to stick to the data structure:

{  name: ..  children: [    {      name: …,      children: [],    }  ]}

Otherwise d3 hierarchy won’t work.

Creating the final svg

The last function is huge and it’s mainly some SVG adjusting, styling and adding texts. I won’t go into much detail here because its mainly base svg knowledge — although it is based on d3 which brings some special functions such as data, joins, selections and so on.

const chart = () => {
const processedData = processData(data);
const root = tree(processedData);
let x0 = Infinity;
let x1 = -x0;
root.each(d => {
if (d.x > x1) x1 = d.x;
if (d.x < x0) x0 = d.x;
});
const svg = d3.create("svg")
.attr("viewBox", [0, 0, width, x1 - x0 + root.dx * 2]);
const g = svg.append("g")
.attr("font-family", "sans-serif")
.attr("font-size", 8)
.attr("transform", `translate(${root.dy / 3},${root.dx - x0})`);
const link = g.append("g")
.attr("fill", "none")
.attr("stroke", "#555")
.attr("stroke-opacity", 0.4)
.attr("stroke-width", 1.5)
.selectAll("path")
.data(root.links())
.join("path")
.attr("d", d3.linkHorizontal()
.x(d => d.y)
.y(d => d.x));
const node = g.append("g")
.attr("stroke-linejoin", "round")
.attr("stroke-width", 3)
.selectAll("g")
.data(root.descendants())
.join("g")
.attr("transform", d => `translate(${d.y},${d.x})`);
node.append("circle")
.attr("fill", d => d.children ? "#555" : "#999")
.attr("r", 2.5);
node.append("text")
.attr("dy", "0.31em")
.attr("x", d => d.children ? -6 : 6)
.attr("text-anchor", d => d.children ? "end" : "start")
.text(d => d.data.name)
.clone(true).lower()
.attr("stroke", "white");
return svg.node();
}

At the end of our main.js we simply attach the chart function to the window object:

window.treeChart = chart;

Displaying the chart

Now, head back to the index.html and insert a script part below the image. I use to set type=“module“ because I am only starting Chrome to display this image and download it. I do not plan to make it work for several browsers.

So, what are we doing here? First of all, we store the tree svg in a local chart variable. Then we serialize this string and create a bas64 out of it to put as image source. And that’s it. The sitemap will show up.

<script type="module">
var chart = window.treeChart();
var img = document.querySelector('img');
// get svg data
var xml = new XMLSerializer().serializeToString(chart);
// make it base64
var svg64 = btoa(xml);
var b64Start = 'data:image/svg+xml;base64,';
// prepend a "header"
var image64 = b64Start + svg64;
// set it as the source of the img element
img.src = image64;
</script>

Additional notes: I haven’t tried the JavaScript code on deeper level than 2. And as always: if you think, I did a part too complicated, let me know 😉

The result

So, in case of the website of my employer (mediaman.com), the current sitemap looks like this:

Again, you find the complete code that I used in Gist:

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store