A couple of days ago my wife asked me if there was any way to know the average rating of the movies in which an actress had took part. She also wanted to know if it was possible to know the evolution of that actress over the years.
Probably, that information already exists in IMDb or in similar pages, but, I thought it was a good opportunity to play a little with Python, Pandas, Matplotlib, Numpy. Here are some of the tests I did.
Obtaining the IMDb ID
The first thing we needed to know was the IMDb ID of the actor/actress.
I found that IMDb has a kind of undocumented API which allows to retrieve the ID knowing the actor/actress name. So I put together a few python lines to get it.
Getting the movies of a specific ID
Next, I needed to know the movies in which the actress had participated. I didn’t found an API for this, so I did a bit of web scraping.
I thought about using BeautifulSoup, but using lxml to find the class “filmo-row” and extract the list of movie IDs was quite straightforward.
Then, I needed to process all “filmo-row”s in the list in order to get the movie data . I wanted to know at least the ID, title, year and imdbRating. This time I used OMDb API in order to get the data I wanted.
Using Pandas
Once I had the array with the movie data, I converted it to Pandas Dataframe, and used pandas.DataFrame.describe to get a summary of statistics.
Next, in order to see which were the best and the worst movies of the actor/actress, I used Pandas to sort the movies by imdbRating.
Results
These are some of the results:
Leonardo Di Caprio
Angelina Jolie