Gathering data without those pesky databases

Danny Denenberg
Sep 3 · 4 min read
Photo by Markus Spiske on Unsplash

Web scraping is a great way to create dynamic websites without having to contact a database for information.

To get started with web scraping, you should know how a website is structured. If you right-click on a page and click inspect (on Chrome), you can see the developer tools.

This shows you the structure of the website’s HTML/CSS/JavaScript code, as well as network performance, errors, security, and much more.

Now, let’s say I want to grab the first image that you see on Twitter programmatically in the JavaScript console.

Well, I could right-click on the image, click inspect, right-click on the element in the dev tools, and copy the CSS selector.

Then, I could do a document.querySelector(<<SELECTOR>>).src and that would give me the URL of the image I want, and I could use that on a web page, for example:

This is web scraping! I was able to gather data (an image) from a website without having access to the database. But this is super tedious and long, so to actually web scrape more efficiently, I use Node.js and Puppeteer.

If you don’t already know, Node.js is a runtime environment that allows JavaScript to be run on the server-side. And Puppeteer is a ‘headless Chrome node API’ written by Google (basically, it allows you to write DOM JavaScript code on a server).

Just an FYI, because I love TypeScript, I will be using that language for this project. If you want to use TypeScript, please install it on your system. If running tsc -v works in the terminal, you're good to go!

Okay, to start off, make sure you have Node.js and npm (Node Package Manager) installed on your system. If you get a command not found or something related by running one of the following, I suggest that you look at this article on how to install Node.

$ npm -v # should be 6.0.0 or higher $ node -v # should be 9.0.0 or higher

Great! Let’s start a new project and install the dependencies:

$ mkdir Web-Scraping-101 && cd Web-Scraping-101 
$ npm init # go through all defaults
$ npm i puppeteer # the google npm scraping package
$ tsc --init # initialize typescript
$ npm i @types/puppeteer # type declarations

Now, open the folder in the text editor of your choice. Edit the outDir option in the tsconfig.json file to be ./build and uncomment the line, so it looks like this:

Create a new file in the root of the folder:

$ touch app.ts

In app.ts add:

console.log("Twitter, here we come");

To run this, in the terminal, write: tsc && node build/app.js

Note: tsc builds all TypeScript files into the outDir directory defined in the config file and node runs a single JavaScript file.

If you see Twitter, here we come appear in the terminal, you’ve got it working!

Now, we will start to actually scrape using Puppeteer. Add this boilerplate Puppeteer code to the app.ts file:

Please read through the commented code above to get a feel for what is going on.

Now that you can see how we can travel to a web page, gather info using DOM manipulation, and bring that info back to the Node js program, we are ready to scrape Twitter.

First, edit the await page.goto("https://example.com") to be await page.goto("https://twitter.com").

Next, we need to be able to get the posts from the middle column (the actual Twitter feed). After some investigating, I found this selector is the one that actually selects the div for the middle column feed:

document.querySelector("#react-root > div > div > div > main > div > div.css-1dbjc4n.r-aqfbo4.r-1niwhzg.r-16y2uox > div > div.css-1dbjc4n.r-14lw9ot.r-1tlfku8.r-1ljd8xs.r-13l2t4g.r-1phboty.r-1jgb5lz.r-1ye8kvj.r-13qz1uu.r-184en5c > div > div > div.css-1dbjc4n.r-1jgb5lz.r-1ye8kvj.r-6337vo.r-13qz1uu > div > section > div > div > div");
// the above returns the div for the middle column twitter feed

Here is an image of what that represents:

To get all of the images from the middle column, I ended up doing this for the page.evaluate() function:

Now, if I want to compile a list of all of the image sources and print them out to the console, all I have to do is write this outside of the page.evaluate() function:

console.log(dimensions.sources);

There you go! You’ve just scraped image data from a Twitter feed.

A final challenge would be to take this data and integrate it into an Express.js server so that, when a user goes to the root site, they are presented with all of these scraped images.


Resources

Thanks for reading!

Better Programming

Advice for programmers.

Danny Denenberg

Written by

I tell stories and discuss amazing technology. ❤ Family. denenberg.us

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade