How to use The Movie Database API for a Data Science Project

Michael Orlando
8 min readJul 11, 2022

--

Photo by Myke Simon on Unsplash

In this 6-part series, I’ll explain my process of using Natural Language Processing and Machine Learning to classify the genres of screenplays.

For more information, check out my repo.

Part 1: Business Objective

Part 2: Data Collection

Part 3: Data Wrangling (you are here)

Part 4: Data Preprocessing (not posted yet)

Part 5: Model Building (not posted yet)

Part 6: Model Deployment (not posted yet)

Welcome data science and movie enthusiasts of Medium. This is part 3 of my 6-part series where we use NLP and Machine Learning to build a multi-label classification model to label the genres of a movie screenplay.

If you have not checked out Parts 1 & 2 of the series, where I discuss how to use BeautifulSoup to scrape for film screenplays, the link is here and above.

Part 3: Data Wrangling — Labelling Genres and Creating our Targets

The genres for our screenplays we’ll be retrieved using the python wrapper tmbdsimple to connect us to The Movie Database API. Then our targets we’ll be transformed using one-hot encoding.

Steps We’ll Take:

  1. Importing necessary packages
  2. Loading in screenplays using the os package
  3. Labeling the screenplay genres using the tmbdsimple package
  4. Creating our targets using one-hot-encoding

For this source code, check out data_wrangling and data_wrangling_pt2 from my repo

1. Importing Necessary Packages

2. Loading in screenplays using the os package

In part 2 of this series, we discussed how to save the screenplay txt files to a folder using python. Now, we’re going to load those files and append them to a dictionary object using the os package.

First, we’re going to create a dictionary object.

#initializing dict
screenplays = {'title': [], 'text': []}

Our two keys are title and text, and the respective values are empty lists. This is because we’re going to convert this dictionary to a pandas dataframe later on.

Next, we’re going to create a function called screenplays_loader(dct) that saves our screenplays from the script_texts folder into a dictionary object.

def screenplays_loader(dct):

"""
This function takes in a dct as parameters and returns an updated dct with title and text keys, and lists of titles and screenplay text respectively

"""

directory = os.fsdecode('script_texts/')

for file in os.listdir(directory):

filename = os.fsdecode(file)
text = open(directory + '/' + filename, 'rb').read()
if len(text) > 0:
dct['title'].append(filename.strip('.txt'))
dct['text'].append(text)
else:
continue

First, we create a variable called directory which is a decoded string of our file location. Then we loop through the os.listdir(directory) object, which is a list of files in the specified path. In our case, it’s our folder with our screenplays. Next, we append the filename as the title to the title list and the file text as the text to the text list.

For more information on how to use the os methods and packages used above, check out these three links: os.fsdecode, os.listdir, and os documentation.

Now run the function like so:

#running the function
screenplays_loader(screenplays)

And check the length…

#checking the length
print(len(screenplays['title']))
2125

We have 2125 screenplays in a python dictionary object. Now we’ll save the dictionary as a Pandas dataframe.

#converting the dict into a pandas dataframe
data = pd.DataFrame(screenplays)

3. Labeling the screenplay genres using the tmbdsimple package

The movie titles were saved with the script tag, underscores (_), and uneven spacing. It is necessary to clean the title names before using The Movie Database API.

We’ll use the pandas and regex package to clean the movie titles.

#cleaning titles
data['title'] = data.title.str.replace('scrip', '')
data['title'] = data.title.str.replace('_', ' ')
data['title'] = data.title.apply(lambda x: re.sub(r"\B([A-Z])", r" \1", x))

Note, when accessing the titles in our dataframe, it’s necessary to use the .str method to change the texts of the column. Not using the .str method will throw up an error.

Also, for the scope of this part of the series, I will not be going into detail about the regex package; that will take place in part 4. Check out this link if you want to understand what the re.sub method is.

Now we can connect to The Movie Database API.

First, you’re gonna have to download the tmbdsimple wrapper and create a TMDB account. Simple instructions to do so are linked here.

tmdb.API_KEY = 'YOUR SECRET CODE' #codes are avaliable for free when signing up on their website

#search object that looks up movie information by title
search = tmdb.Search()

#genre object
genre = tmdb.Genres()

#saving genres and coressponding codes for labelling
genres_lst = genre.movie_list()

The search object allows us to request TMDB. For example, if we wanted to request information about The Avengers, our code would look like this:

#querying for The Avengers
search.movie(query='The Avengers')

The results:

As you see, the results are in JSON format. We want to retrieve the data in the genre_ids of the first result.

To do so:

#querying for The God Father
search.movie(query='The Avengers')['results'][0]['genre_ids']

The result:

[878,28,12]

The result is a list of number(s). This is why we created a genre variable and genres_lst variable. TMDB labels their movies by ids of integer type. The genre_lst shows us what id corresponds to what genre.

print('genres_lst){'genres': [{'id': 28, 'name': 'Action'},   {'id': 12, 'name': 'Adventure'}, {'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 80, 'name': 'Crime'}, {'id': 99, 'name': 'Documentary'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'},{'id': 14, 'name': 'Fantasy'}, {'id': 36, 'name': 'History'}, {'id': 27, 'name': 'Horror'}, {'id': 10402, 'name': 'Music'}, {'id': 9648, 'name': 'Mystery'}, {'id': 10749, 'name': 'Romance'}, {'id': 878, 'name': 'Science Fiction'},{'id': 10770, 'name': 'TV Movie'}, {'id': 53, 'name': 'Thriller'}, {'id': 10752, 'name': 'War'}, {'id': 37, 'name': 'Western'}]}

Now we can label the genre of The Avengers movie as Science Fiction, Action, and Adventure. However, querying and labeling each screenplay individually will take too long, so we’ll write a simple function to do it for us.

The meat of this function is located in the for-loops. First, we loop through the query results and then loop through the list of the genres. The function checks if the genre_id from the query results matches the id from the list of genres. If it does, then the name of the genre (i.e. Action, Comedy, etc) is appended to the lst variable.

To run this function, we’re going to use the Pandas apply method so we create a new column in our already existing dataframe.

#applying function on all titles in dataset
data['genre'] = data.title.apply(lambda x: genre_labeller(x))

The result will look like:

print(data.loc[:, ['title', 'genre'])

As you see, each row has a list of genres under the genre column.

For more information on how to use the Pandas apply method, check out its documentation.

4. Creating our targets using one-hot-encoding

To train a multi-label classification model, the targets have to be in binary format. More specifically, each genre will be its own column and if the screenplay matches that genre, there will be a 1, and if not then a 0.

For example, the movie Knocked Up is labeled as a Comedy, Romance, and Drama movie. The columns for those genres for this movie will look like this:

However, Knocked Up isn’t labeled as any of the other 15 genres, therefore, the entire row will look like this:

Altogether, we want the entire dataframe to look like this:

To achieve this goal, we’re first going to clean the genre column.

The way the data was queried and copied from TMDB, saved it as string type variables instead of list type variables in the genre column. Therefore, it’s necessary to clean the column before one-hot encoding it.

Next, we’re going to change Science Fiction to SciFi and delete the TV Movie genre. Alternatively, you can keep the TV Movie genre; I deleted it because I did not consider it a genre.

Run the function using the pandas .apply method.

#applying lst_breaker function
data['genre'] = data.genre.apply(lambda x: lst_breaker(x)).copy()

Then we’re going to create a list of genres.

#creating a list of genres to one hot enocode

genre_lst = []

for i in data.genre:
for x in i: #loops through genre column and appends genre names to list
genre_lst.append(x)

genre_lst = list(set(genre_lst)) #creates a set to remove duplicates

Print the list:

print(genre_lst)['Crime',
'Romance',
'Animation',
'SciFi',
'Fantasy',
'History',
'Action',
'Drama',
'War',
'Thriller',
'Mystery',
'Documentary',
'Horror',
'Family',
'Adventure',
'Music',
'Comedy',
'Western']

As you see, we have a list of all 18 genres we wanted. Now, we’re going to write a function to one-hot-encode each row in our dataframe.

def genre_encoding(movie_genres, genre):

"""
This function takes a list of genres and a genre name. If the genre name exists in the listthen thee function returns 1. Else, returns 0. Ultimately, this is the function that one hot encodes our targets.

"""
if genre in movie_genres:
return 1
else:
return 0

The function has two parameters, movie_genres and genre. The movie_genres variable is the list of genres associated with the specific movie. For example, if we were one-hot-encoding Knocked Up, then the movie_genres variable would be set to [‘Comedy’, ‘Drama’, ‘Romance’].

Our new dataframe:

We’re finally done labeling our data. I’d recommend saving the dataframe as a new CSV file.

In the next part of the series, I will be demonstrating how to use the wordcloud package to illustrate the counts of words by genre.

References:

--

--