Facebook scraper bot using Node JS and Mongo DB

Devanshani Rasanjika
Nerd For Tech
Published in
5 min readMay 15, 2022

--

I’m working on a app which needs to render Facebook feeds based on topic channels using Facebook oEmbed end points which can find, filter and engage with customers across social media. Facebook oEmbed endpoints allow you to get embed HTML and basic metadata for pages, posts, and videos in order to display them in another website or app. Since Facebook official API documentation deprecated for getting recent media including feed content ,URL ,created timestamp, like/comment and share counts for feeds , I have designed a Facebook scraper to extract these information. Here I’m going to implement my solution using Node JS puppeteer as the automation tool and Mongo DB to store the collected feed data.

Puppeteer is a node library which provides high level API to control chrome or chromium and most of the things we can do manually in the browser can be done using puppeteer.

There are many 3rd party libraries exist that I have research but there are limitations since Facebook platform do updates for their CSS styles and classes we can not use dynamically generated class styles for our scraping to become a stable solution. So it is not a good practice to depend on a 3rd party solution because we can not ensure whether it going to work with Facebook latest updates. So I have designed a own Facebook wrapper using Node JS , Below I will going to explain step by step of my implementation.

So let’s get to do it.

Pre-requisites:

  • Basic knowledge of Node JS

Step 01 :

As the first step added initPuppeteer() method which import puppeteer library ,starts the browser and creates a new page as well as sets the browser width and height and override permissions.

Step 02 :

In this step I have added some common configuration messages in a separate file as below.

Here we’ll add a loginFacebook() which visits Facebook website and wait until network idle and wait until it appears the login UI and perform login action as below.

I have put random timeouts to wait for actions because if the actions are too fast our bot will be easily detected by Facebook which may result in our account being blocked or blacklisted.

Here I have select the UI elements by evaluating the inner HTML because selectors are changing randomly in Facebook.

Step 03 :

In this step We’ll going to start the scraping based on filter tags such as @mentions/keywords/hashtags. So my next step would be to direct to the Facebook page based on the filter tag as below . I have configured base URL in my configuration file.

const page_url = config.base_url + filter_tag;await this.page.goto(page_url, {waitUntil: "networkidle2",});

Next , I want to check whether the availability of the page to start the scraping steps as below. Here I have used some UI display text contents which indicate to the user about the page existence. So I have configured these possible messages in the configuration file.

Step 04 :

Next , if the Facebook page available for the relevant filter tag I will going to scroll over the page and render the feeds to the DOM content as below. Here we can configure the no. of posts to be scraped using post_count parameter. Here I have identified that a single post is wrapped around div[role="article"] tag. So I’m going to count the no . of div[role="article"] tags being rendered to the DOM content by scrolling over the page .

Step 05 :

My next step is to identify the posts to be extracted because in the previous step we finished the scrolling logic based on the div[role="article"] tag but I have identified that it includes some other text contents too. So we need to filter out those. Here as a unique element I have identified that if its a post to be extracted its ariaLabel is a null value. Based on that I’m going to add a filter as below and for each post I’m going to return text content and inner HTML.

Step 06 :

As my next step I’m going to loop over my filter list and extract the post content, reactions count, share count and comments count and the post created timestamp. Here to extract these figures need to parse the post inner HTML through an HTML parser and get the root html as below. For that I have used node-html-parser Which will generate a simplified DOM tree, with element query support.

import { parse } from “node-html-parser”;const root_html = parse(filtered_list[i].html);

Extracting post content :

Extracting Comments count :

Extracting Reactions count :

Extracting Share count :

As my next step , I need to format these data counts as below.

Extracting post created Timestamp :

Here , Facebook platform does not allow us to scrape the timestamp directly from the DOM content . They have ordered each element in the timestamp string using style tags. So We need to traverse over the span tags and need to find the style of these elements to find the order of them . Next in my logic I have inserted these elements to an array by using the style order as the array index. Next I have identified there are {display:none;} style classes which injected to these elements with the style order. So I need to skip these elements as below.

Next , We need to format post created time stamp in to a common format to save in the database for that I’m going to return the epoch time by formatting scraped post created timestamp.

And after that I can create the post created time to save in the DB as below.

const ymd_timestamp = moment.unix(timestamp / 1000).format(config.date_time_format);

Step 07 :

As my next step I need to extract the post URL to render the Facebook feeds in the web app using Facebook oEmbed as below. From the post actions I’m going to extract the feed oEmbed and using this string I will going to extract the post URL and post id as below.

Step 08 :

As the final step I’m going to update the Mongo DB with the extracted Facebook feed record in below format.

let dataObj = {post_id: post_id,post_text: post_text,screen_name: filter_tag,post_created_at: ymd_timestamp,attributes: {share_count: share_count,comment_count: comment_count,reaction_count: reaction_count,page_link: page_url,link: post_url,},time_stamp: Math.round(new Date().getTime() / 1000),};

There are limitations in this solutions as

Facebook change UI element selectors frequently because of that I have evaluate the page to find out the visible text content and perform scraping. But there is a risk of changing those elements too. So we need to maintain our solution .

There are security concerns as if the scraper did many actions within a short time Facebook black listed for some time because of exceeding rate limitations.

Thank you for reading!

--

--