Data Scraping in Node.js 101
Gathering data without those pesky databases
Web scraping is a great way to create dynamic websites without having to contact a database for information.
To get started with web scraping, you should know how a website is structured. If you right-click on a page and click inspect (on Chrome), you can see the developer tools.
Well, I could right-click on the image, click inspect, right-click on the element in the dev tools, and copy the CSS selector.
Then, I could do a
document.querySelector(<<SELECTOR>>).src and that would give me the URL of the image I want, and I could use that on a web page, for example:
This is web scraping! I was able to gather data (an image) from a website without having access to the database. But this is super tedious and long, so to actually web scrape more efficiently, I use Node.js and Puppeteer.
Just an FYI, because I love TypeScript, I will be using that language for this project. If you want to use TypeScript, please install it on your system. If running
tsc -v works in the terminal, you're good to go!
Okay, to start off, make sure you have Node.js and npm (Node Package Manager) installed on your system. If you get a
command not found or something related by running one of the following, I suggest that you look at this article on how to install Node.
$ npm -v # should be 6.0.0 or higher $ node -v # should be 9.0.0 or higher
Great! Let’s start a new project and install the dependencies:
$ mkdir Web-Scraping-101 && cd Web-Scraping-101
$ npm init # go through all defaults
$ npm i puppeteer # the google npm scraping package
$ tsc --init # initialize typescript
$ npm i @types/puppeteer # type declarations
Now, open the folder in the text editor of your choice. Edit the
outDir option in the
tsconfig.json file to be
./build and uncomment the line, so it looks like this:
Create a new file in the root of the folder:
$ touch app.ts
console.log("Twitter, here we come");
To run this, in the terminal, write:
tsc && node build/app.js
tsc builds all TypeScript files into the
outDir directory defined in the config file and
If you see “Twitter, here we come” appear in the terminal, you’ve got it working!
Now, we will start to actually scrape using Puppeteer. Add this boilerplate Puppeteer code to the
Please read through the commented code above to get a feel for what is going on.
Now that you can see how we can travel to a web page, gather info using DOM manipulation, and bring that info back to the Node js program, we are ready to scrape Twitter.
First, edit the
await page.goto("https://example.com") to be
Next, we need to be able to get the posts from the middle column (the actual Twitter feed). After some investigating, I found this selector is the one that actually selects the
div for the middle column feed:
document.querySelector("#react-root > div > div > div > main > div > div.css-1dbjc4n.r-aqfbo4.r-1niwhzg.r-16y2uox > div > div.css-1dbjc4n.r-14lw9ot.r-1tlfku8.r-1ljd8xs.r-13l2t4g.r-1phboty.r-1jgb5lz.r-1ye8kvj.r-13qz1uu.r-184en5c > div > div > div.css-1dbjc4n.r-1jgb5lz.r-1ye8kvj.r-6337vo.r-13qz1uu > div > section > div > div > div");
// the above returns the div for the middle column twitter feed
Here is an image of what that represents:
To get all of the images from the middle column, I ended up doing this for the
Now, if I want to compile a list of all of the image sources and print them out to the console, all I have to do is write this outside of the
There you go! You’ve just scraped image data from a Twitter feed.
A final challenge would be to take this data and integrate it into an Express.js server so that, when a user goes to the root site, they are presented with all of these scraped images.