The Art of Not Getting Blocked: How I used Selenium & Python to Scrape Facebook, and Tiktok

Oren Spiegel
Analytics Vidhya
Published in
8 min readNov 12, 2019
Photo by Kai Pilger on Unsplash

Not too long ago it was incredibly easy to obtain facebook user data. The graph-api was a solution that allowed programmers to access public user/pages data with a degree of freedom. Since Cambridge Analytica went down and data privacy became a big concern, this reality has however changed. Facebook has made it impossible to mine user data at a massive scale.

Tiktok (Formerly known as Musically) has also elected not to provide any legal means of mining their data. It is heavily concerned with the privacy of their (mostly) teenager users and looks to ensure the platform doesn’t create means for predatory data fetching.

In this tutorial I will use Python to set you up with the selenium driver. More importantly, I will share with you valuable ground rules that will help you not get blocked by Facebook/Tiktok system administrators.

First, let’s discuss why we need Selenium.

Selenium: A browser simulator

If we tried to make a regular (cUrl/urlib/requests) GET request to facebook or tiktok we would obtain partial html dom. Some elements will be missing, mostly from the <body> tag.

Facebook and Tiktok are Javascript based pages, and urllib doesn’t know how to parse these.

Selenium will open Chrome (or Firefox if you wish), go to the desired url, wait for the javascript to load, and only then fetch and return the html.

Now let’s dive into the code.

Above are the imports we will be using. It will require that you install the following python dependencies:

  1. selenium==3.141.0
  2. random_user_agent==1.0.1

You will also need to have chrome installed locally on your machine.

I am starting a new class called Request, this will serve you well if you have different types of requests in your algorithm.

logger is simply a version of python’s print() that logs to stderr.
You may ignore it!

The selenium retries variable will help us keep track of the number of times a selenium request failed. Trust me, considering we’re dealing with Facebook/Tiktok this will be crucial to you.

Diving into the get_selenium_res() function:

the get_selenium_res() function 1/2

What you first see is the setting up of some user agent variables that use the random_user_agent outer module dependency I imported to automatically obtain a random user agent for each selenium call I make.

A User-Agent looks like this:
Mozilla/5.0 (Linux; Android 8.0.0; SM-G960F Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36

Think of it as the means by which you as the user accesses the site;
an identifier. Remember, we are not here to get caught while scraping data, thus; we will pretend to be a different user-agent upon each new request.

lastly, I set up chrome_options. These will be used later once we declare a selenium browser instance.

Chrome-Options: argument by argument

  1. “ — headless”: Chrome browser won’t physically open on your machine, hence; reduce the load on your machine’s cpu. I suggest playing with this feature, trying it on and off while running locally. On the one hand running selenium headlessly is perfect to keep your machine “cool”, however it may help get you flagged as a scraper. System administrators can spot a headless request with ease.
  2. “ — no-sandbox”: The only way to get the chromedriver to open headlessly. If you are using a Firefox instead you may ignore this.
  3. “ — disable-gpu”: Apparently only needed on windows machines but I am showing it just to be safe.
  4. “ — window-size=1420,1080”: Selenium acts like a normal web browser in that it will only return the dom that it was required to load. What you see is what you get. So imagine a browser opening a tiny window — you wouldn’t see much of the dom and hence you wouldn’t obtain back much of the html in the response. We therefore must stretch the window size.

How to use a proxy

the get_selenium_res() function 2/2

The above chunk of code deals with proxy use. Later I will shed some insight on whether the use of a proxy will actually be useful in our case:

BUT learning how to incorporate a proxy in selenium is not well covered online. Let’s go over it together:

First set up a PROXY variable. You may obtain a free one from here. A random proxy looks something like this: “http://123.45.678.21:8080”
(“http://IP:PORT”) Make sure you get an active https residential proxy.

Binary location was set up to point chrome to the right driver location on my local drive. You may encounter a ‘bug’ if this variable is not declare. It is unrelated to proxy use, I am including this since it may prove troublesome.

I later initiate an instance of a Proxy() object. I set it up to Manual, and also flag autodetect to False to make sure we don’t declare to the world that we are using a manual proxy. I then proceed to give it the rest of Chrome’s default desired capabilities.
I set up the http and ssl poxy settings equal to the PROXY variable we declared earlier. As a final step I must add these capabilities to the Proxy() instance, and set up is complete.

Final word on the proxy topic:
If you’re going to use a proxy provider use one that will allow you to whitelist your local IP and request a random residential proxy without entering a username and password through selenium. Smartproxy is recommended for this reason. Set the PROXY variable equal to the host:port address of the proxy rotating api.

Initiating a Web Browser

You may easily toggle the use of a proxy by changing the way you initiate a new Browser instance:

# browser = webdriver.Chrome(chrome_options=chrome_options,
# desired_capabilities=capabilities)
browser = webdriver.Chrome(chrome_options=chrome_options)

Note that I will be initiating a a browser without a proxy for reasons which I later in this tutorial get into.

Using the Selenium Browser

If you are using a proxy, you may want to test it if it’s indeed working.

Use:

# when testing proxies
browser.get('http://lumtest.com/myip.json')

The above address will return a json with the location you are mimicking. If proxy setup was unsuccessful you will see your computer’s current (true) IP address.

We will send a GET request to the url we announced the Request object for…

browser.get(self.url)

If you are not running headlessly — the above line will physically open a browser window on your machine and go to the desired url.

So far so good? Now it gets a bit more interesting:

In this tutorial will use an explicit wait which I consider a ‘Best-Practice” whenever it can be applied . It means that Chrome will not close or return the dom to us before a particular element is rendered. presence_if_element_located((By.CLASS_NAME, class_name)) is our way to tell the browser to wait until an element with a specified particular class finishes rendering.

VERY IMPORTANT: class_name is a variable that must be passed into the request method. In this example you may pass a string of the name of the class you would like to wait on, like so: Request(‘www.facebook.com’).get_selenium_res(‘name_of_class’)

Once this element is rendered I order the browser to maximize the window(we already set it up in chrome settings — but here I show an additional way of doing this) so that we obtain the entirety of the dom.

browser.page_source is the command used to fetch and store in memory the html.

Do not forget to close your browser using browser.close().

Handling Exceptions

A Timeout Exception will happen when a selenium request wasn’t able to obtain any response in the time frame set up (time_to_wait). This is a classic BLOCK. Unless something is wrong with the url you have provided, this should be a signal that automatic firewalls were activated to prevent you from mining data.

A WebDriverException mostly showed up on the Unix machine I set for up for production. This exception is NOT a result of Facebook/Tiktok blocking you.

If there’s an exception; I simply return the same get_selenium_res function (creating a potential never ending loop of selenium failures)… I set it up this way since I know Facebook/Tiktok will block me and I would like to automatically retry. Making a repeating attempt at a GET request may sometimes do the trick.

Now that you have the technical ability to a make a selenium GET request let me share with you some extremely valuable insights from my experience scraping Tiktok and Facebook in the post Cambridge Analytica era.

Facebook and Tiktok : (Avoiding a) Block Party

Facebook and Tiktok are both on the lookout for data miners. They don’t simply check for user-agent and IP address to find us “crawlers”. Here are some takeaways:

  1. Proxy + Headless = “Please block me”.
    The best way to get blocked is to use a proxy while running selenium headlessly. Remember Tiktok/Facebook know something is up the minute you go headless, cause they can see fag that. If you are not using a proxy, they generally allow a good flow of requests before slowing and eventually shutting down responses… If you run selenium headlessly with a proxy; About after 20 requests I already predict responses will start slowing down to a minute until receiving a response. If you insist your behaviour despite them slowing you down, they will shut the party down completely.
  2. A headless selenium browser is smooth! Running selenium headlessly will not have any impact on your ability to receive a response, AS LONG AS you’re not using a proxy. It creates a much better experience for testing locally.
  3. If you scrape with a SYNC worker, making one request at a time, Tiktok/Facebook will allow a considerable amount of requests to go through before slowing down response time. If you don’t exceed several hundreds of requests a week, you may be able to scrape without interruption. Best strategy may be to pace your requests.
  4. At times when I exceeded (roughly) a hundred requests per hour — immediately the response times slowed down… Selenium Timeout Exceptions will be a strong signal for this. Pacing and spacing is key when making these requests. Their firewall has automatic block logic that is built on hourly quotas from what I can infer.
  5. All GET requests must die. Regardless of the methodology you choose, or how unpredictable your algorithm may be; eventually Facebook/Tiktok will slow down the pace at which they return to you a response. At this point there is no point in fighting their firewall… Do not make a request for a recommended 3 hour period.

Feel free to chime in with your Selenium scraping experiences.

--

--