Web Scraping termed as Web data extraction, Web harvesting, Screen Scraping, is a vital mechanism in today’s world. Through Web-Scraping you can extract useful public information from your targeted websites and put together for data analysis, product comparison, making statistical reports, and many more. Python is undoubtedly the most popular language for web scraping and today I am going to give an example of extracting data from IMDB’s website. We are going to get the top 250 movie rankings from all time and display any random 10 movies to the user.
So, let's dive in without spending any more time! At the end, I am going to elaborate the reason for choosing the chosen coding structure. I am assuming you have a basic understanding of Python and HTML. We need the package BeautifulSoup or bs4 in python to do this tutorial.
Firstly, in the terminal write the following command and press enter to install BeautifulSoup package:
pip install bs4
then import the following modules at the top of the file
from bs4 import BeautifulSoupimport requestsimport reimport random
Now we are going to write a class named ExtractMovies, you can, of course, choose any other name if you want to!
#Python class for declaring movie attributes.
def __init__(self, title, year, star, ratings ): self.position = position self.title = title self.year = year self.star = star self.ratings = ratings#function to make ratings to two decimal places
def first2(s): return s[:4]
Here, we are declaring the attributes related to a single movie and storing it as an object. Later on, we are going to populate the movie object with their unique characteristics or attributes. We are going to see the use of the function first2 later on, so chill for now!
url = 'https://www.imdb.com/chart/top/'response = requests.get(url)soup = BeautifulSoup(response.text, 'html.parser')movies = soup.select('td.titleColumn')links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]years = soup.select('span.secondaryInfo')#Temoporary array to store class instances
_temp_ = 
In the above part:
first-line: We are declaring the url as a variable, this is the URL to IMDB top movies chart: https://www.imdb.com/chart/top/
second-line: Declaring a variable to send an HTTP request to the given url and receive the HTML response in text format.
third-line: Beautifulsouping the elements! Means, we will be selecting and processing the text with this variable.
fourth-line to onwards: With “soup.select” we are selecting the elements of the HTML object in the requested url.
One more thing, are you thinking of what these “td.tableColumn”, “href” , “title” or “posterColumn” doing? Okay, these are the descriptions of the elements of the html page we are working. You can follow the url and inspect the page in the developer mode to understand more. You can also follow this link to view the detailed documentation on different ways of using BeautifulSoap.
for index in range(0, len(movies)): movie_string = movies[index].get_text() movie = (' '.join(movie_string.split()).replace('.', '')) movie_title = movie[len(str(index))+1:-7] year = years[index].get_text() position = index+1 movie_instances = ExtractMovies( movie_title, year, crew[index], first2(ratings[index]) ) _temp_.append(movie_instances)
Here, yes we are looping through the range of the object movies that we got earlier and storing each of the data to its required fields, later we are assigning those fields to the class instance and appending it to the _temp_ array that we created earlier. And now the first2 function, we are using to make the ratings to two decimal places. Ratings here is a string object, you may use any other algorithm to convert it to Float if required.
random.shuffle(_temp_)i=1for obj in _temp_: print(i,"|", obj.title,'\n',obj.year,'\n',obj.star,'\n',obj.ratings,'\n'
) i=i+1 if(i==11) break
In this last part, at the beginning, we are shuffling the array to get random movies, and then we are printing the output in a decorated format. We keep checking for the iteration to become 10, whenever it reaches 10, we are breaking out of the for loop.
The reason for choosing this class instance method is because it gives you more freedom and you can easily call this class anytime in your code if you want to extend your code further! You can also do this by putting the movies in Dictionary. I am going to explain the differences between Dictionary, List, and Class objects in one of my future blogs.
Oh! I forgot to mention, this is my first ever blog online!😊 I am so excited to write this article and publish it here on medium! I appreciate your reviews and feedbacks, or on anything you recommend me to write on! 🤞🤞
The entire code of this tutorial is as follows: