How to Scrape JavaScript Heavy Sites Like a Pro With Python

Jonathan Joyner
The Dev Project
Published in
5 min readApr 4, 2022
Photo by Lautaro Andreani on Unsplash

JavaScript is everywhere. If you find a website with no JavaScript on the page, you can bet it’s from the 1990’s.

That presents problems for web scraping. Most of the time, the data is right in the HTML of the page. It can be easily seen and scraped. However, there are times when the data is only available after the JavaScript is rendered.

That presents a challenge for web scraping. I’ve written past articles on web scraping which focus on easy-to-use Python libraries.

Unfortunately, that method breaks when you introduce JavaScript rendering into the mix.

So how do we handle websites that trap their data behind a JavaScript rendering? Well, we’re going to have to use some more advanced tools.

The Right Tool for the Job

Usually, I would recommend a couple of go-to libraries for web scraping:

  • Requests
  • BeautifulSoup

These two tools can do a whole lot, even if you’re going through several pages for data. They can’t be used to render JavaScript though.

There is a whole collection of tools that are built for this type of job. These tools are group into a category known as Browser Automation.

The goal of a browser automation tool is to simulate the web browsing experience but have it automated so that it can be run at intervals or speeds that a person couldn’t achieve.

These are touted more for their website testing capabilities for website owners. They also happen to have everything we need to render JavaScript and scrape the underlying data.

Some of the more popular tools in this category are:

In this example, we’ll focus on using Selenium. We’ll also use our trusty library BeautifulSoup to parse the response.

Setting up the Workspace

Since we are fully automating a web browser, a little more setup is required than just a few pip installs. We’ll need a few other things installed:

  • Chrome (or other web browser, we’ll be using Chrome in this example)
  • ChromeDriver (web driver for Chrome)

Go ahead and install Chrome if you would like to follow along. For the ChromeDriver install, we’ll use a handy Python library that will do that for us.

With that said, let’s go ahead and install the libraries we’ll be using:

pip install selenium
pip install bs4
pip install chromedriver-autoinstaller

Once those are all installed, we can start importing:

import chromedriver_autoinstaller
from selenium import webdriver
from bs4 import BeautifulSoup

The chromedriver_autoinstaller library will handle installing ChromeDriver and adding it to PATH if it is not already there, which takes a bit of work off our plate. We can do that with one simple line:

chromedriver_autoinstaller.install()

That’s pretty much it for setting up our environment. Just to recap, we pip installed selenium, bs4, and chromedriver-autoinstaller. Our Python file should now look like this:

Imports

Getting The Webpage

With our environment set up, we can begin requesting web pages. To do that we need to set up our webdriver object which selenium will use:

driver = webdriver.Chrome()

And we can go ahead and tell the driver to fetch a web page. In this example, we’ll be scraping Rotten Tomatoes Certified Fresh Movies.

The data we are after (movie titles, ratings, etc.) can be found without rendering the JavaScript. However, parsing the data is much easier when it is rendered.

This page is rendered almost entirely with JavaScript, here is the site with JavaScript enabled:

JavaScript Enabled

And with JavaScript disabled:

JavaScript Disabled

We can request this web page by using our driver objects “get” method:

driver.get('https://www.rottentomatoes.com/browse/cf-dvd-streaming-all')

And we can get the html output using the page_source attribute:

html = driver.page_source

Just to recap, here is where we are at on our code:

Parsing the HTML

Now, Selenium can parse the data. We’ll use that in certain scenarios. However, BeautifulSoup will be our go-to for parsing the HTML. So let’s make a soup out of the page source:

soup = BeautifulSoup(html, 'html.parser')

Now, we should have something that looks like this:

Soup

If we were to print out this soup object that we’ve made. We get the full web page, minus some of the fancy formatting. Luckily, we don’t have to wait for the JavaScript to execute on this page.

In some cases, we will have to wait for JavaScript execution. Which can be done by either Implicitly Waiting or Explicitly Waiting.

Since we don’t have to worry about that, let’s find the information we’re looking for:

Rendered Movies

It looks like all the movies we are looking for are located inside their own div with the class “mb-movie”.

Each of these hold the information about the individual movie:

Movie Info

We can get at each one of them and get the title, score, and release date easily with BeautifulSoup:

Parsed Info

Conclusion

So we have covered quite a lot in a short span of time. Here’s a recap of what we’ve done:

  • Installed Chrome
  • Installed ChromeDriver using a Python library
  • Pulled a JavaScript heavy web page using Selenium
  • Parsed and gathered data using BeautifulSoup

Here’s one final look at where we ended up, with the data printed out in the terminal:

If this helped you out, the best way to support me is by following me on Twitter or here on Medium!

Feedback is my friend, so feel free to reach out and tell me that you liked my story, want some topic covered, or that some part of this could be done better.

--

--