Machine Learning 101 — Web Scraping using BeautifulSoup
Web Scraping is a basic technique used to access and extract large amounts of data from websites. Although the process can be done manually, it often refers to the process of using a bot or web crawler to automatically extract the data we require from specific web pages. This ends up saving a lot of time and effort for users. Today we’ll go through a small example in which we scrape reviews from IMDb.
Web Scraping in Python
Python contains a few frameworks which can be utilized for web scraping:
- Scrapy: This library is the complete package in the sense that it automates all tasks, including downloading our pages in HTML and storing them in the format we want. Since it is a full-fledged framework, the learning curve is very steep and it may be too complex for simple tasks.
- BeautifulSoup: A very beginner-friendly library which has a small learning curve. It is essentially a parsing library that creates a parse tree and helps us find our data on web pages easily.
- Selenium: A library for automation that accesses browsers and extracts data from the HTML that is rendered via JavaScript. It mimics human actions by performing clicks, selection, and scrolling.
- Urllib: A package with several modules for working with uniform resource locators (URLs). It comes pre-installed in the Python library and defines simple functions and classes which perform URL actions.
- Requests: It is similar to the urllib library, except it provides a much simpler interface and is extremely useful for straightforward tasks.
For our purposes, we will make use of BeautifulSoup and Requests to extract reviews from IMDb. Our focus is going to be on Indian Shows on Netflix and Amazon Prime Video. Our list of shows include:
Netflix:
- Sacred Games
- Little Things
- Jamtara: Sabka Number Aayega
- Bard of Blood
- Leila
Amazon Prime Video:
- Mirzapur
- Made in Heaven
- Inside Edge
- Panchayat
- The Family Man
A Word of Caution
Most websites may not allow web scraping because they contain sensitive information. To check whether or not a web scraper may be allowed, first visit the /robots.txt page of that website. The IMDb /robots.txt page displays the following:
Also, it is important to remember not to overuse a web scraper by performing a huge amount of requests in a small amount of time. This could lead to your IP address being marked and subsequently blocked, so always be careful. Most websites such as Yahoo! Finance and Twitter offer very well-developed APIs to extract data nowadays so you can use them instead.
Let the Web Scraping begin!
The first task is to open our web page and inspect it by right-clicking on the element we wish to extract information from. We are currently extracting reviews for Sacred Games, an Indian show which took Netflix by storm in 2018.
Once we click on inspect the DevTools tab opens up on the right side of our page (since I’m using Google Chrome). After this we have to find the various divs which contain the information we’re looking for.
from bs4 import BeautifulSoup as bs
from requests import get
url = 'https://www.imdb.com/title/tt6077448/reviews?ref_=tt_urv'
response = get(url)
bs4object = bs(response.text, features='html.parser')
The code cell above opens up the URL we specify and uses the BeautifulSoup HTML parser to store our data as a BS4 object. We will extract the reviews, review titles, usernames, and the date on which the review was posted.
As we can see from the screenshot above all our reviews are stored in a div whose class component is called content. Upon further inspection, we observe that the data we require is stored in various containers as follows:
# container which has all usernames
user = bs4object.find_all('span', attrs={'class':'display-name-link'})# container which has all review dates
review_dates = bs4object.find_all('span', attrs={'class':'review-date'})# container which has all review titles
review_titles = bs4object.find_all('a', attrs={'class':'title'})# container which has all reviews
review_tags = bs4object.find_all('div', attrs={'class':'text show-more__control'})# name of TV show
name = bs4object.find('meta', property='og:title')
The code above extracts all the information we mentioned earlier and stores them in containers. For now, we’ve only managed to extract reviews for Sacred Games. To get reviews for other shows all we need to do is replace the URL. Thus, we can define a function as follows:
def get_review(url):
response = get(url)
bs4object = bs(response.text, features='html.parser')# container which has all usernames
user = bs4object.find_all('span', attrs={'class':'display-name-link'})# container which has all review dates
review_dates = bs4object.find_all('span', attrs={'class':'review-date'})
# container which has all review titles
review_titles = bs4object.find_all('a', attrs={'class':'title'})
# container which has all reviews
review_tags = bs4object.find_all('div', attrs={'class':'text show-more__control'})# name of TV show
name = bs4object.find('meta', property='og:title')for i in range(0,len(user)):
username.append(user[i].text)
review_date.append(review_dates[i].text)
review_title.append(review_titles[i].text)
review.append(review_tags[i].text)
show_name.append(name['content'])
Since we put all of our data into lists we can put it all in a pandas Dataframe:
df = pd.DataFrame({'show':show_name, 'username':username,
'date':review_date, 'title':review_title, 'review':review})
Now let’s save our data into a CSV file so that we can access it any time we want to. Since I prefer to use a Google Colab environment the code looks like this:
from google.colab import files
df.to_csv('indian_shows.csv')
files.download('indian_shows.csv')
Our CSV file now contains all our information as shown below:
Congrats you’ve finally created your first Web Scraper! Thanks for reading and stay tuned for more!
Note: The code mentioned in this article only extracts limited reviews from our IMDb page. Now that you know where to start, try to extract all the reviews from this page. (Hint: Check out Selenium!)
Resources: