How to use the browser console to scrape and save data in a file with JavaScript

Scrapping Medium Story metadata from browser console

Praveen
The Mighty Programmer
5 min readDec 3, 2018

--

Photo by Lee from Unsplash

A while back, I had to crawl a site for links, and further use those page links to crawl data using Selenium. Setup for the content on the site was bit uncanny so that I couldn’t start directly with Selenium and node. Also, unfortunately, data was enormous on the site. I had to quickly come up with an approach to first crawl all the links and pass those for details crawling of each page.

That’s where I learned this cool stuff with the browser Console API. You can use this on any website without much setup, as it’s just JavaScript.

Let’s jump into the technical details.

High-Level Overview

I wrote a small piece of JS in the Console. This JavaScript crawls all the links (takes 1–2 hours, as it does pagination also) and dumps a JSON file with all the crawled data.

Keep in mind, this work for single page applications or webpage which don’t call browser-refresh for changing content. If browser refreshes while crawling, you may lose your work.

Let’s crawl a Medium story and save the scraped data in a file from the console automatically after scrapping.

Here’s demonstration of the final execution:

Demo

Let’s move to execution:

1. Get the console object instance from the browser

// Console API to clear console before logging new data console.API;if (typeof console._commandLineAPI !== 'undefined') {
console.API = console._commandLineAPI; //chrome
} else if (typeof console._inspectorCommandLineAPI !== 'undefined'){
console.API = console._inspectorCommandLineAPI; //Safari
} else if (typeof console.clear !== 'undefined') {
console.API = console;
}

The code is trying to get the console.API object instance based on the user’s current browser; You may ignore and directly assign the instance to your browser.

i.e. if you are using Chrome, the below code should be sufficient.

if (typeof console._commandLineAPI !== 'undefined') {
console.API = console._commandLineAPI; //chrome
}

2. Defining the Junior helper function

I’ll assume that you have opened a Medium story as of now in your browser. Lines 6 to 12 define the DOM element attributes which can be used to extract story title, clap count, user name, profile image URL, profile description and read time of the story, respectively.

These are the basic things which I want to show for this story. You can add a few more elements like extracting links from the story, all images, or embed links.

3. Defining our Senior helper function — the beast

As we are crawling the page for different elements, we will save them in a collection. This collection will be passed to one of the main functions.

We have defined a method named console.save which dumps collected data a JSON file when called.

Console.Save()

It also starts downloading of collected data in JSON format with <a> link download trick.

Here is the quick demo of console.save with a small array passed as data.

Putting together all the pieces of the code, this is what we have:

  1. Console API Instance
  2. Helper function to extract elements
  3. Console Save function to create a file

Let’s execute our console.save() in the browser to save the data in a file. For this, you can go to a story on Medium and execute this code in the browser console.

I have shown the demo with extracting data from a single page, but the same code can be tweaked to crawl multiple stories from a publisher’s home page. Take an example of freeCodeCamp: you can navigate from one story to another and come back (using the browser’s back button) to the publisher home page without the page being refreshed.

Below is the bare minimum code you need to extract multiple stories from a publisher’s home page.

Let’s see the code in action for getting the profile description from multiple stories.

For any such type of application, once you have scrapped the data, you can pass it to our console.save function and store it in a file.

The console.save method can be quickly attached to your console code and can help you to dump the data in the file. I am not saying you have to use the console for scraping data, but sometimes this will be a way quicker approach since we all are very familiar working with the DOM using CSS selectors.

You can download the code from Github

Thank you for reading this article! Hope it gave you cool idea to scrape some data quickly without much setup. Hit the clap button if it enjoyed it! If you have any questions, send me an email (praveend806 [at] gmail [dot] com).

--

--