How to scrape film screenplays using Python and Beautiful Soup

Michael Orlando
6 min readJul 9, 2022

--

In this 6-part series, I’ll explain my process of using Natural Language Processing and Machine Learning to classify the genres of screenplays.

For more information, check out my repo.

Part 1: Business Objective (you are here)

Part 2: Data Collection (and here)

Part 3: Data Wrangling

Part 4: Data Preprocessing (not posted yet)

Part 5: Model Building (not posted yet)

Part 6: Model Deployment (not posted yet)

Part 1: Business Objective

“Hey, you check out that new horror movie by M.Night Shyamalan?”

“Yeah, it was terrible. It was so corny and not scary at all. I actually laughed during the movie.”

Horror film enthusiasts don’t want to laugh, they want to shit their pants.

But how did this studio attract the horror enthusiast audience to come to see the movie in the first place?

Genre is the element that attracts the audience.

Studios use genres to market their product (the film) through their advertising efforts. However, to have a blockbuster, they have to make sure the movie fits the genre that the audience is anticipating.

My goal is to use NLP and machine learning to help studios streamline their screenplay buying process by classifying the genre elements of the screenplays.

Part 2: Scrapping Screenplays using Beautiful Soup

The data collected is from The Script Savant. Altogether it’s about 2000 screenplays. We’re going to be using BeautifulSoup for our web scrapping efforts. For more information check out their documentation.

Steps We’ll Take:

  1. Import packages
  2. Scrap the pdf links of screenplays from A-M using Beautiful Soup
  3. Scrap the pdf links of screenplays from N-Z using Beautiful Soup
  4. Download the pdfs using the requests package
  5. Convert the pdf files to txt files using convertapi

For this source code, follow this link to my repo.

1. Importing python packages.

2. Scrap the pdf links of screenplays from A-M using BeautifulSoup

If you go to this link: https://thescriptsavant.com/free-movie-screenplays-am/ , you’ll find about 1000 films with names from A-M. This is important to note because the screenplays from N-Z have a slightly different HTML syntax so our BeautifulSoup code will look different. (I’ll go into more detail shortly)

If you click on any of those hyperlinks, it will display the screenplay in pdf format. For our scraping purposes, we want the name of the screenplay and the respective link to the pdf file.

Ultimately, we want a dictionary object containing screenplay names as the key and the links to the pdf as the value.

For example, the key and value of our dictionary for 12 Angry Men will look like this:

scrapping code from my repo

First, use the requests.get method to retrieve access to the text on the website. Then create a BeautifulSoup object to store the HTML syntax in.

Next, we save a variable as an empty dictionary. In our case, we’ll call this variable movie_dct. Then we’ll loop through soup.find(‘tr’).find_all(‘a’) to collect the data we want.

soup.find(‘tr’) creates an object with this information:

We want the pdf links in the ‘href’ attribute and the name of the screenplays in the <u tag. soup.find_all(‘a’) will create a list of each <a tag containing the ‘href’ attribute.

If you squint your eyes or zoom in, you see that the name of each screenplay is written in the <u tag.

The for-loop then checks if the <u tag is empty and if it isn’t, then a new dictionary key is created using the <u tag as the key and the ‘href’ link as the value.

Altogether, we want our movie_dct to look like this:

3. Scrap the pdf links of screenplays named N-Z using Beautiful Soup

The HTML syntax for N-Z screenplays is a bit different, so we’ll have to change up our for-loop to look like this:

Like before, we create our requests and BeautifulSoup objects to get the data from the website.

Next, we set num = 0 so we can skip the first three rows in our for-loop. This is because there are no screenplays in those three rows.

For this webpage, the pdf hyperlinks are contained in a <div tag with the class ‘fusion-text’. Then we want to create a list using the find_all function on the <p tag which is contained in the object created before.

In each <p tag in the list we just created, there are multiple <a tags that contain the name of the screenplays and the pdf links to them. And like the code before, we use an if-statement to check if the <u tag exists.

This is definitely more complex than the previous HTML syntax, so I recommend visualizing the structure of the syntax either by using inspect element or using python. Also, check out BeautifulSoup documentation when following along too for a better understanding.

4. Download the pdfs using the requests package

Download is probably the wrong word to describe this process. We are rather writing the content from the hyperlinks into a pdf file.

First, create a folder to store the pdf files. I created the folder called new_scripts to store data. Then we set to use the requests.get(stream=True). After, we want to write in the content from the hyperlink by looping through the r.iter_content() object.

For the scope of the article and limitations of my expertise, I will not explain why to set stream to equal True or the use of the function .iter_content.

I’d recommend checking out Aaron S.’s article on the request package. He does a far better job of explaining the package’s power and limitations than I can. Also, check out the requests documentation as well to learn more.

At the end, your folder should look like this:

5. Convert the pdf files to txt files using convertapi

Convertapi is a service that allows developers to connect to their api to convert file formats. For our project, we want our files to be converted from pdf to txt files. There are other ways to convert pdf files to txt files but I chose convertapi for its simplicity and clean results.

Check out their service for more information

First, we want to set our API secret code. You can get a free code on their website by signing up.

Next, we’re going to create a new folder for screenplay text files. I named it script_texts. After, we’re going to loop through our movie_dct and use the convertapi.convert method to change the files in our new_scripts folder into text files in our script_texts folder.

Remember we used the key value to name the pdf file, so that’s why we’re looping through the movie_dct.

The new_scripts folder should look like this:

I recommend looking at the new text files in your script_texts folder to see if the conversion was done properly. Also, the conversion process can be a while for this much data. It took me about 8 hours.

In this next part of the series, I will be discussing how to use The Movie Database API to label the genres of the screenplays we collected, and other data wrangling processes I completed before my exploratory data analysis and preprocessing.

References:

--

--