Exploratory Data Analysis on IMDb movies.

Thmscrntt
5 min readJul 27, 2021

--

Part 1
By: Thomas Cornett
Files Location:
GitHub

Introduction:

This is going to be a 2 part tutorial starting with scraping the data off of the website and then exploring the scraped data so that we can get insightful information based on that data.

Structure of project parts:

  • Part One: Web scraper
  • Part Two: Exploratory Data Analysis

Web Scraping

Web scraping is the ability to go onto a website and pull information from the website by using the html encoding of the website. What this allows us to do is access and extract large amounts of data from the websites that would originally be inaccessible to people. Make sure to check with the Terms and Conditions of the website before attempting this though, as it can get your computer black-listed from the website. We will be using a simple web scraper built to get the full information about movies from the IMdB website. I was given the movie budgets from Flatiron Schools Data Science course, but the information will be available on my GitHub page for this series. Web scraping can be intimidating, but hopefully after this you will have a better understanding of it to help with future projects as well.

Website Inspection

The first thing we need to do is figure out where the values that we want are located. When we go to www.imdb.com/search/title here we will select the search parameters of the movies we want to see. For sake of simplicity reasons we will only search based off of the release date.

Editing of search functions.

We are doing it this way because this will get us a wider range of movies. If willing to you can select other special sort features available. After running the search we will see this or something similar to this

Example of the search by title

With that we are about to start the web scrape.

Scraping the website us Python BeautifulSoup

We will be using Python and BeautifulSoup as our web scraping package. So with that we have some dependencies to download. If you don’t already have a Python IDE I would recommend either Pycharm community edition or you can use Jupyter Notebook, though any IDE will work those two are just what I use. With Jupyter Notebook opened up we will start off by installing the BeautifulSoup packages.

!pip install beautifulsoup4
!pip install requests
import requests
from requests import get
from random import randint
from time import sleep
from bs4 import BeautifulSoup

That should allow you to operate the Beautiful Soup scraper. if not just add extra lines with !pip and what is not working or there and it will install it. I would recommend commenting the installs out of the code after it was ran using ctrl-/. With that the basic requirements are done. next is pulling information from the website.

Inspecting the website

Once we have the installs done we go back to the webpage and either press f12 or right click and select inspect page.

inspect page

Once you click the inspect or f12 you will see a bunch of lines of stuff on the right of the screen. Alternatively if you right click on the picture of one of the movies you will be pulled to that movies container. That container is what we want. will look like this:

This is the html code for each of the movie containers. If you look lower than what is selected you can see multiple <div class = “lister-item mode-advanced”>…</div> closer inspection shows that they hold all the information we need about the movies. This is what we need, so keep that in mind.

Back to coding

With the above in mind we will setup the rest of the scraper. First make eight lists, doesn’t matter what you name them as long as you remember what is going in each one.

title = []
movie_rating = []
year = []
star_rating = []
gross = []
movie_genre = []
movie_time = []
votes = []

After that we use the requests library to access the information inside those containers and set it to a variable.

page = requests.get('https://www.imdb.com/search/title/?release_date=,2020-12-31')

Once we request that page we put it into BeautifulSoup to process the information.

soup = BeautifulSoup(page.text, 'html.parser')

Once it is processed through with BeautifulSoup then we loop through the containers and supplement tags that each piece of the data we want is in.

container = soup.findAll('div',class_ = 'lister-item mode-advanced') 
#for loop looking for each of the catagories in each of the containers
for name in container:
#nv is so we can seperate the gross income and seperate it from the votes set into their own
nv = name.find_all('span',attrs ={'name':'nv'})

title.append(name.h3.a.text)
movie_rating.append(name.p.span.text)
year.append(name.h3.find('span',class_ = 'lister-item-year').text)

if name.strong is not None:
star_rating.append(float(name.strong.text))
else:
star_rating.append(0)

gross_income = nv[1].text.strip('$\n\tM') if len(nv) > 1 else '0'
gross.append(gross_income)
votes_total = nv[0].text if len(nv) >1 else 0
votes.append(votes_total)

if name.p.find('span',class_ = 'genre') is not None:
movie_genres = name.p.find('span',class_ ='genre').text
movie_genre.append(movie_genres)
else:
movie_genre.append('N/A')

if name.p.find('span',class_ = 'runtime') is None:
movie_time.append(0)
else:
movie_time.append(name.p.find('span', class_ = 'runtime').text)
sleep = randint(3,5)

With that we have everything we need from the html code all that is left is to clean the data and export to a csv file.

Thank you for reading, stay tuned for Part 2: Exploratory Data Analysis where we take the scraped data and look for any trends and statistics of the data.

--

--