It’s Called Show “Business” (1/2) — Scraping Letterboxd and IMDb to Get Movie Info Using Selenium and Requests for Python

Data Science Filmmaker
6 min readOct 30, 2023

--

This is a project I completed about a year ago when I was still on the hunt for producers, production companies, financiers, and movie stars. I needed to get a bunch of information from the Internet Movie Database (IMDb) and put it in a format that allowed me to manipulate and search as needed.

As my movie is a horror/thiller, the first thing I needed was a list of horror and thriller movies. For this, I turned to a different movie site, Letterboxd, which maintains lists of every genre, sorted by popularity.

Unfortunately for me, these lists are generated dynamically at run-time via javascript, which meant that I couldn’t just pull the page’s html using “requests”. What I needed was a library that could get the html after the page had been fully rendered.

Enter “Selenium”. Selenium allows you to execute javascript in a virtual browser. To use it, I needed to install Selenium as well as a driver for the particular browser I wanted it to simulate.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())

(The drivers can be installed manually, but the package “webdriver_manager” streamlines the process. This is for Selenium version 3. The syntax is slightly different for Selenium 4.)

Letterboxd displays 72 results per page, with over 500 pages of horror alone at the time I did the scraping. Each page has to load, execute javascript, and render in the browser before the code can move on to the next, so it took a few hours to download both entire lists. I used lxml to parse the html and get the title, id number, link to the Letterboxd page for that movie, and the year it was made. I then outputted this info to a csv file.

from lxml import html

# pd.set_option('display.max_columns', None)
genres = ['horror','thriller']
n_pages = [509,392] # This was true when I ran the code. Found manually.

### Function to scrape Letterboxd lists by genre and put into a csv
### This function uses Selenium to get teh eweb page data because
### the pages that we are scraping are created synamically at
### run time via javascript, so we have to wait for them to load
### completely before scraping

def scrape_letterboxd_lists():

# Open the virtual browser
browser = webdriver.Chrome(ChromeDriverManager().install())

# For each genre
for g in range(len(genres)):

# Open a file for that genre
filename = 'data/letterboxd' + genres[g] + '.csv'
with open(filename, 'w') as file_object:

page = 1 # Start at page 1
total = 0 #

# Get every page
while page <= n_pages[g]:

print(f"Starting page number {page}...")
pagename = f'https://letterboxd.com/films/popular/genre/{genres[g]}/size/small/page/{page}/'
browser.get(pagename)
print("Getting innerHTML...")
innerHTML = browser.execute_script("return document.body.innerHTML")
print("Creating tree...")
tree = html.fromstring(innerHTML)

# Grab the relevant data and put into lists
print("parsing...")
newtitles = tree.xpath('//ul[@class="poster-list -p70 -grid"]/li/div/@data-film-name')
newurls = tree.xpath('//ul[@class="poster-list -p70 -grid"]/li/div/@data-film-link')
newyears = tree.xpath('//ul[@class="poster-list -p70 -grid"]/li/div/@data-film-release-year')
newids = tree.xpath('//ul[@class="poster-list -p70 -grid"]/li/div/@data-film-id')

if ((len(newtitles) != 72) & (not (page>=n_pages[g]))):
print("Glitch") # Page did not load properly so go back to the
continue # top of the loop without incrementing and try again
else:
page += 1
total += len(newtitles)
print("writing to file...")
print(f"{page} {total}")
for i in range(0,len(newtitles)):
file_object.write(f'"{newids[i]}",')
file_object.write(f'"{newtitles[i]}",')
file_object.write(f'"https://letterboxd.com{newurls[i]}",')
file_object.write(f'"{newyears[i]}"\n')

print(f"File {filename} written")

I threw away any movies that were older that 2010 (since I am looking for production companies that have produced recent movies only), and newer than 2022 (which was the year that I made this list). I ran a simple “requests” call for the individual Letterboxd page to get the runtime, the star rating, and the IMDb link. Anything shorter than 70 minutes (i.e. not feature length) was discarded, as was anything without an IMDb link or a release year. Since we started with separate lists of “horror” and “thriller”, and there is some overlap, duplicates were discarded. The full code for this is available on my github (the function “get_lb_info” in “scrape_imdb.py”). This gave me a single list with 17,542 movies, formatted like:

id,title,url,year,runtime,imdblink,genre
682547,Nope,https://letterboxd.com/film/nope/,2022,130,http://www.imdb.com/title/tt10954984/maindetails,Horror
680358,X,https://letterboxd.com/film/x-2022/,2022,106,http://www.imdb.com/title/tt13560574/maindetails,Horror
706064,Fresh,https://letterboxd.com/film/fresh-2022/,2022,114,http://www.imdb.com/title/tt13403046/maindetails,Horror
572119,Scream,https://letterboxd.com/film/scream-2022/,2022,114,http://www.imdb.com/title/tt11245972/maindetails,Horror

Now that I had the IMDb link for each movie, I could use it to get various pieces of info about the movie, including the names of cast and crew, MPA rating, country of origin, etc. Unfortunately, IMDb doesn’t have an API, so this all had to be scraped from HTML. Fortunately, unlike Letterboxd, the HTML is complete without having to wait for javascript to load. Unfortunately, I needed to use my IMDb Pro account, which means that I needed to tell requests how to log in using my credentials. Fortunately, there is a neat trick for this. Unfortunately, that trick doesn’t work for this particular website. Fortunately, there is a workaround.

The trick involves telling requests to open a session by posting your login data to whatever page that data is normally sent to. In theory, as long as you remain in that session, you can access pages that require login. This trick is well-covered in many places on the web, so I won’t repeat the details. But in the case of IMDb Pro, for reasons that I was never able to fully determine, the site is smart enough to thwart this particular avenue of scrape-itude.

No worries. Instead, I just manually logged in using Chrome, copied the cookies and headers from the Inspector, stuck them into a text file, and then fed them to Requests as Python dictionaries.

import ast

with open('imdb_cookies.txt','r') as f:
cookies = ast.literal_eval(f.readline())
headers = ast.literal_eval(f.readline())

pagename = f"https://pro.imdb.com/title/{this_movie['imdb_id']}/details"
page = requests.get(pagename, cookies=cookies, headers=headers)
tree = html.fromstring(page.content)

I decided to use lxml to parse the HTML rather than Beautiful Soup. I was able to grab the title, Movie Meter (i.e. the movie’s search frequency rank among all movies on IMDb), overall user star rating, genres, MPA rating, language(s) and country/countries of origin via commands such as

this_movie['imdb_title'] = tree.xpath('//div[@id="title_heading"]/span/span/text()')[0] #title

As I was primarily interested in U.S.-based and/or English language companies, I filtered out any movies that didn’t have at least one of these things. I then got the budget (if available), the domestic and world-wide grosses (if available), cast members and their individual Star Meters, director(s), producer(s), writer(s), and any company that produced, distributed, or sold the film. I then went to each individual page for each of those companies and got additional information, such as Company Meter (similar to Movie Meter and Star Meter) and the territories and formats that that company handled.

Again, the full code for this can be found on github (“populate_list_details” in “scrape_imdb.py”).

In the end, I had a rather huge Python dictionary of every horror and thriller movie that met my various criteria, as well as the talent and companies associated with that movie.

[
{
"title": "Nope",
"year": 2022,
"run_time": 130,
"lb_id": 682547,
"lb_url": "https://letterboxd.com/film/nope/",
"imdb_url": "http://www.imdb.com/title/tt10954984/maindetails",
"imdb_id": "tt10954984",
"lb_stars": 3.91,
"imdb_title": "Nope",
"movie_meter": 80,
"imdb_stars": 6.9,
"rating": "R",
"country of origin": [
"United States",
"Japan",
"Canada"
],
"languages": [
"English"
],
"budget": 68000000,
"domgross": 123277080,
"worldgross": 170823080,
"genre": "horror",
"imdb_genres": [
"Horror",
"Mystery",
"Sci-Fi"
]
"directors": [
{
"name": "Jordan Peele",
"credit": "Director (directed by)",
"link": "https://pro.imdb.com/name/nm1443502/?ref_=tt_fm_name"
}
],
"writers": [
{
"name": "Jordan Peele",
"credit": "Writer (written by)",
"link": "https://pro.imdb.com/name/nm1443502/?ref_=tt_fm_name"
}
],
"producers": [
{
"name": "Ian Cooper",
"credit": "Producer (produced by) (p.g.a)",
"link": "https://pro.imdb.com/name/nm9827373/?ref_=tt_fm_name"
},
{
"name": "Karen Ruth Getchell",
"credit": "Co-Producer",
"link": "https://pro.imdb.com/name/nm0315202/?ref_=tt_fm_name"
},
{
"name": "Robert Graf",
"credit": "Executive Producer",
"link": "https://pro.imdb.com/name/nm0333747/?ref_=tt_fm_name"
},

...

],
"actors": [
{
"name": "Daniel Kaluuya",
"credit": "OJ Haywood",
"link": "https://pro.imdb.com/name/nm2257207/?ref_=tt_cst_1",
"starmeter": 1889
},
{
"name": "Keke Palmer",
"credit": "Emerald Haywood",
"link": "https://pro.imdb.com/name/nm1551130/?ref_=tt_cst_2",
"starmeter": 1860
},
{
"name": "Brandon Perea",
"credit": "Angel Torres",
"link": "https://pro.imdb.com/name/nm5155952/?ref_=tt_cst_3",
"starmeter": 13053
},

...

],
"prodcos": [
{
"name": "Universal Pictures",
"link": "https://pro.imdb.com/company/co0005073/?ref_=tt_co_prod_co",
"company_meter": 69
},
{
"name": "Dentsu",
"link": "https://pro.imdb.com/company/co0169264/?ref_=tt_co_prod_co",
"company_meter": 4853
},
{
"name": "Monkeypaw Productions",
"link": "https://pro.imdb.com/company/co0369235/?ref_=tt_co_prod_co",
"company_meter": 380
},
{
"name": "The Government of Canada Income Tax Credit Program",
"link": "https://pro.imdb.com/company/co0031449/?ref_=tt_co_prod_co",
"company_meter": 563135
}
],
"distributors": [
{
"name": "Tulip Entertainment",
"country": "Greece",
"format": [
"Theatrical"
],
"link": "https://pro.imdb.com/company/co0715686/?ref_=tt_co_dist",
"company_meter": 313034
},
{
"name": "United International Pictures (UIP)",
"country": "Argentina",
"format": [
"Theatrical"
],
"link": "https://pro.imdb.com/company/co0097402/?ref_=tt_co_dist",
"company_meter": 28809
},
{
"name": "United International Pictures (UIP)",
"country": "Singapore",
"format": [
"Theatrical"
],
"link": "https://pro.imdb.com/company/co0015307/?ref_=tt_co_dist",
"company_meter": 551733
},

...

],
"sales": [],
},

There are a great many things one can do with this data, some of which I will explore in future posts.

Complete code available at https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/imdb_scrape

Thanks to davidsa03 for his “denumerize” code.

--

--