Auditing the First 100 Pages of the Top 100 Websites 🔎

When you’re browsing BBC news, or streaming your favourite Youtube videos, I doubt you’re thinking about the site’s accessibility. We all frequent the same sites but are they setting good standards of accessibility and performance? Does a site’s popularity correlate with these standards? Let’s find out 💪.

I decided to audit 100 pages per site. This would be a good amount of data to give each site an average. As I’m sure you’ll agree, assessing 10,000 pages manually would be far too time consuming. Luckily, with a little automation and the power of lighthouse we can audit them all with a click of a button instead.

Not interested in the code? Check out the results here.

Disclaimer: This was a fun experiment. Results here are by no means a complete or comprehensive assessment of the audited sites. All views expressed in this article are my own and do not represent the opinions of any entity with which I have been, am now or will be affiliated with.

How I Audited 10,000 Web Pages

Puppeteer

One of my favourite npm packages that I have discovered in the last year is puppeteer. Puppeteer is a Node library that provides an API to control chrome over the dev tools protocol. It can run headless (in the background), so you don’t even see it working away.

this.browser = await puppeteer.launch();
this.page = await this.browser.newPage();
await this.page.goto("http://sld.codes/");
await this.page.screenshot({path: `screenshot.png`, fullPage: true});

A super simple example of the power of this tool can be seen above. By running this snippet, a chromium instance will open in the background, navigate to my personal website, and take a full page screenshot of the page.

Clustering

You can also run a cluster of puppets! This allows you to tackle larger problems asynchronously with multiple instances of puppeteer workers. A simple example of how this works can be found in the readme:

const { Cluster } = require('puppeteer-cluster');

(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
});

await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
});

cluster.queue('http://www.google.com/');
cluster.queue('http://www.wikipedia.org/');

await cluster.idle();
await cluster.close();
})();

Lighthouse

There is no better tool out there for automated auditing. Most people are familiar with its integration into the chrome dev tools but you can also run it using a node CLI. All I needed to do to get it going was to point it to the puppeteer instance.

var { lhr } = await lighthouse(this.page.url(), {
port: new URL(this.browser.wsEndpoint()).port,
output: "json",
quiet: true,
});

Sourcing the top 100

You can define the top 100 sites in lots of different ways. For this experiment, I settled on Top 100 most visited by search traffic. I sourced the list here. With a little regex and a few multi cursors, I got the table on the page into JSON. I could then feed it to the cluster.

The Script

The web-crawler I created was fairly basic. Upon landing on a webpage, it would look through and search for any <a> tags and add the hrefs to a first-in first-out queue. I used a set to hold visited urls to ensure I didn’t audit the same page twice.

let newHrefs = await page.$$eval("a", as => as.map(a => a.href));
newHrefs = newHrefs.map(url => url.replace(/\/$/, ""));
newHrefs = newHrefs.filter(
url =>
!toVisit.includes(url) &&
!visited.has(url) &&
psl.get(extractHostname(url)) === hostName
);

It turns out sites don’t always use <a> tags. This sometimes meant that my crawler would finish for any given site before reaching 100 pages. I didn’t notice this was the case until the audits had finished. If I were to run this experiment again I would make sure to improve the crawlers ability to search sites.

To audit all 10,000 pages, took 21 hours and 32 minutes.

That sounds like a crazy long time, but it works out to 7.74 seconds per audit. It should be noted that my internet is no where near fibre optic speeds and I am pretty sure that my provider throttled my connection. But, I don’t think this subtracts anything from the results when most of the world is still using 3G.

Visualising The Information

Time to turn 10,000 JSON files into something useful 😅.

Parsing 1.83gb of Audit Data

I ended up with 1.83gb of audit data in JSON. I have been using JSON in react and Gatsby projects for years but never have I had to import JSON of this magnitude. It was a headache. First thing I tried to do was increase the memory that node had available. Turns out I couldn’t give it enough.

It suddenly occurred to me that while these reports contained some really useful information, they also contained 3000 lines that I was not interested in. I decided to write a script that would prune the data down to the fields in the JSON that I really needed.

This reduced 1.83gb down to 50.9mb. Much more manageable!

Gatsby + GraphQL

I used the gatsby-transformer-json package to pull all the JSON into my Gatsby project. Then used this query to source the data:

{
allAllJson {
edges {
node {
categories {
accessibility {
score
}
performance {
score
}
best_practices {
score
}
seo {
score
}
}
id
requestedUrl
}
}
}
}

Build was a little slow, but it worked!

Visualisations

I used the recharts library, to create scatter graphs, showing the correlations. I was worried that it would not be able to handle soo many data points but I was wrong. Recharts smashed it 💪.

The Results

In the coming days, I intend to summarise the data I’ve acquired. In the mean time, you can check it out here.

Check out the results here.

--

--

--

sld.codes | UX Engineer @ American Express

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

All you need to know about NPM

Add component prop based on condition in ReactJS

My Javascript Journey

20+ Free Websites to Become a Javascript Expert in 2022

In 2022 There will be 20 Free Websites Where you may Learn JavaScript

Honest look at Ember in the middle of 2018

Manage Circular Imports Hell with GraphQL-Modules

Advanced Javascript Concepts: Type Coercion

Expert Guide — Top 10 Trending JavaScript Frameworks in 2022

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sam Larsen-Disney

Sam Larsen-Disney

sld.codes | UX Engineer @ American Express

More from Medium

Ship It Days #9

A squad demoing their SID project

Will the 2000-year-old Japanese tradition of Hanko seals go digital?

Ways To Improve Remote Collaboration and Communication

Ways To Improve Remote Collaboration and Communication

Practical guidelines to take your e-commerce to the next level