Machine Learning 101 — Web Scraping using BeautifulSoup

Dhruv Kapoor
May 23 · 5 min read

Web Scraping is a basic technique used to access and extract large amounts of data from websites. Although the process can be done manually, it often refers to the process of using a bot or web crawler to automatically extract the data we require from specific web pages. This ends up saving a lot of time and effort for users. Today we’ll go through a small example in which we scrape reviews from IMDb.

Photo by Pankaj Patel on Unsplash

Web Scraping in Python

Python contains a few frameworks which can be utilized for web scraping:

  • Scrapy: This library is the complete package in the sense that it automates all tasks, including downloading our pages in HTML and storing them in the format we want. Since it is a full-fledged framework, the learning curve is very steep and it may be too complex for simple tasks.

For our purposes, we will make use of BeautifulSoup and Requests to extract reviews from IMDb. Our focus is going to be on Indian Shows on Netflix and Amazon Prime Video. Our list of shows include:

Netflix:

  1. Sacred Games
A Plethora of India TV Shows have made their mark on OTT Platforms

Amazon Prime Video:

  1. Mirzapur

A Word of Caution

Most websites may not allow web scraping because they contain sensitive information. To check whether or not a web scraper may be allowed, first visit the /robots.txt page of that website. The IMDb /robots.txt page displays the following:

https://www.imdb.com/robots.txt

Also, it is important to remember not to overuse a web scraper by performing a huge amount of requests in a small amount of time. This could lead to your IP address being marked and subsequently blocked, so always be careful. Most websites such as Yahoo! Finance and Twitter offer very well-developed APIs to extract data nowadays so you can use them instead.


Let the Web Scraping begin!

The first task is to open our web page and inspect it by right-clicking on the element we wish to extract information from. We are currently extracting reviews for Sacred Games, an Indian show which took Netflix by storm in 2018.

Once we click on inspect the DevTools tab opens up on the right side of our page (since I’m using Google Chrome). After this we have to find the various divs which contain the information we’re looking for.

from bs4 import BeautifulSoup as bs
from requests import get
url = 'https://www.imdb.com/title/tt6077448/reviews?ref_=tt_urv'
response = get(url)
bs4object = bs(response.text, features='html.parser')

The code cell above opens up the URL we specify and uses the BeautifulSoup HTML parser to store our data as a BS4 object. We will extract the reviews, review titles, usernames, and the date on which the review was posted.

The DevTools Tab opens up once we inspect our page

As we can see from the screenshot above all our reviews are stored in a div whose class component is called content. Upon further inspection, we observe that the data we require is stored in various containers as follows:

# container which has all usernames
user = bs4object.find_all('span', attrs={'class':'display-name-link'})
# container which has all review dates
review_dates = bs4object.find_all('span', attrs={'class':'review-date'})
# container which has all review titles
review_titles = bs4object.find_all('a', attrs={'class':'title'})
# container which has all reviews
review_tags = bs4object.find_all('div', attrs={'class':'text show-more__control'})
# name of TV show
name = bs4object.find('meta', property='og:title')

The code above extracts all the information we mentioned earlier and stores them in containers. For now, we’ve only managed to extract reviews for Sacred Games. To get reviews for other shows all we need to do is replace the URL. Thus, we can define a function as follows:

def get_review(url):
response = get(url)
bs4object = bs(response.text, features='html.parser')
# container which has all usernames
user = bs4object.find_all('span', attrs={'class':'display-name-link'})
# container which has all review dates
review_dates = bs4object.find_all('span', attrs={'class':'review-date'})

# container which has all review titles
review_titles = bs4object.find_all('a', attrs={'class':'title'})

# container which has all reviews
review_tags = bs4object.find_all('div', attrs={'class':'text show-more__control'})
# name of TV show
name = bs4object.find('meta', property='og:title')
for i in range(0,len(user)):
username.append(user[i].text)
review_date.append(review_dates[i].text)
review_title.append(review_titles[i].text)
review.append(review_tags[i].text)
show_name.append(name['content'])

Since we put all of our data into lists we can put it all in a pandas Dataframe:

df = pd.DataFrame({'show':show_name, 'username':username,
'date':review_date, 'title':review_title, 'review':review})

Now let’s save our data into a CSV file so that we can access it any time we want to. Since I prefer to use a Google Colab environment the code looks like this:

from google.colab import files
df.to_csv('indian_shows.csv')
files.download('indian_shows.csv')

Our CSV file now contains all our information as shown below:

Our CSV File after Web Scraping

Congrats you’ve finally created your first Web Scraper! Thanks for reading and stay tuned for more!

Note: The code mentioned in this article only extracts limited reviews from our IMDb page. Now that you know where to start, try to extract all the reviews from this page. (Hint: Check out Selenium!)

Resources:

  1. https://www.scrapehero.com/python-web-scraping-frameworks/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store