Photo by Sapeksh Singh Siwach on Unsplash

Making a Web Scraper in Node + TypeScript

An exciting way to browse the web

Micaiah Wallace
Published in
6 min readJan 28, 2022

--

In my career, there have been times when I was tasked with creating a new solution to optimize or boost productivity for an existing legacy business process by implementing smarter technology.

Automation is essential in every business.

I always have a bent towards automation whenever possible in creating software solutions. This often means consuming various APIs to interact with data sources or data management systems as a part of the solution. However, quite frequently, I’m presented with a source that provides little or even no usable APIs to access the data!

For this reason, I’ve decided to write this web scraping tutorial and hopefully provide some useful insight to things I’ve learned developing my own scrapers.

Design Preparation

In order to create effective, reusable and scalable web scraper code, there are a few concepts and details you must consider before hand to properly design your software architecture.

Photo by Daniel McCullough on Unsplash

Create a reusable module, not a one-off script.

Often times, I’ve caught myself trying to quickly put a scraper together to accomplish a task for a given project. Come to find out later, I end up writing very similar structural code in the next project wasting time rewriting boilerplate setup. Save yourself some time and generalize your scraper so it can be reused in future projects (also check out my article: why I over-engineer). Who knows, maybe you’ll end up publishing an open-source project out of it!

Scan over the architecture used by the website.

This means looking at all authentication methods being used, whether the content is dynamically generated or statically served with API generated content. If there is an undocumented API being used by the web application, this can save you a ton of time if utilized correctly over the alternative markup scraping methods. Any details you can gather about the website will help you determine what frameworks and methods you might need to use to extract the data.

Who (or What) will consume your findings? And How?

Sometimes you are building a scraper to simply generate a CSV or PDF report file. Or maybe it is being used as a component in a larger web application which will be consumed by it’s own web clients over an API. Leaving the choice open ended will help keep your code generic enough to be used in your next project (or the same one if requirements change a few months in!)

Scraping Node Blog

For our sample scraper, we will be scraping the Node website’s blog to receive updates whenever a new post is released. We will be creating this scraper in the form of a command line utility, with the option to be used as an importable module. In this example, the consumer of the data will be a cell phone via SMS by using the Twilio messages API.

We start by analyzing the website data structure to determine which of the scraping methods we will need to use. After opening https://nodejs.org/en/blog/ and opening View Source in the browser, we can see that the data received from the server is a simple html file that can be easily parsed without a javascript engine by scrolling down till you find the <ul class="blog-index"> section which contains the actual blog posts. This means we will use a library called Cheerio to parse the html markup for the data we need. While we are here, we capture the structure of blog post as well:

<ul class="blog-index">
<li>
<time datetime="2022-01-11T00:50:00+0000">11 Jan</time>
<a href="...">...</a>
<div class="summary">
<p>...</p>
</div>
</li>
...
</ul>

We also capture the structure for the navigation section we will need to use to follow to older pages to ensure we capture all posts.

<nav aria-label="pagination" class="pagination">
<a href="/en/blog/year-2021/">Older &gt;</a>
</nav>

The Project

I have uploaded the project code to my Github at the following link, so feel free to look there to get the full picture as I will only summarize key points to keep the article to the point.

https://github.com/micaiahwallace/medium-node-blog-scraper

The directory structure looks like this:

src/coordinators — Combines logic, services and other coordinators to organize a logical action
src/services — access resources outside of the program memory such as filesystem IO or network requests
src/logic — business logic containing usually synchronous pure functions that are just making decisions and returning modified values
src/run.ts — entry point for the sample application

Scrapers

We will create 2 separate functions to contain the scraping logic that will fetch the blog posts and the next page navigation links. This is where we will be loading in the html source string into the cheerio runtime for markup parsing.

extractNodeBlogPosts.ts

extractNodeNextPage.ts

We will need a service function to fetch the html string data that will feed the scraper functions. For this function we will perform the http request using a library called axios.

ioFetchUrlText.ts

To tie the pieces together, we will create a coordinator function to encapsulate the logic. This function will be recursively calling itself in order to collect all the posts from every page on the blog.

fetchNewBlogPosts.ts

Now that we have a coordinator to fetch new blog posts, we will need one to fetch the locally cached blog posts to know which posts we’ve already looked at before.

fetchCachedBlogPosts.ts

As you see in the above function we reference another service function ioGetFileJsonArray, we will also want a function to write the posts back to the cache file once we get an updated list called ioWriteFileJson. I won’t go into these here since they are fairly simple. However, I will call out a function we will use to validate the data we ingest from the cache file to validate it since it’s coming from an untrusted source.

validateBlogPost.ts

Now we want to have a function that compares the old and new lists of blog posts and returns the ones we are interested in using to send out notifications.

findNewBlogPosts.ts

The main coordination

We will now look at the coordinator that ties all of these logical functions together in our core action of fetching blog posts, comparing with what we have and sending out notifications. We will generalize this function to allow the pattern to be used for other future projects

watchListAndNotify.ts

We can now create a more specific function to add in the specific functions used for our example application.

scrapeNodeBlog.ts

We left a few parameters to be resolved by the caller so we can specify them in our main run file such as how to handle caching and sending out notifications, also configuration constants such as the entry point for the node blog and maximum post notifications will be specified here as well.

run.ts

Finally… Run it!

Well almost, if you would like to see the fancy twilio integration, then you will need to head to twilio.com to setup an account with an auth token and a phone number, otherwise you can replace the sendTwilioMessage call with a console.log to see how it would function without all the overhead. After you configure your environment variables properly (if using twilio) based on the env.sh.template file into a new file called env.sh and load it into your current shell with source env.sh you can finally run npm run build then npm start to see it in action!

After the first run, you will see up to MAX_NOTIFICATIONSnumber of posts in your console (and possibly messages app!) then any consecutive runs will result in no new posts showing up, until a new post has actually been posted to the blog.

This has been a lot of code and information, so feel free to ask any questions or post any thoughts you have about this post or the general subject of scraping the internet. Oh… and… congrats for making it to the end!

Photo by Jake Ingle on Unsplash

--

--

Micaiah Wallace
Writer for

software engineer. minimalist. enjoy creative software solutions.