Web Scraping Using Python

Akshita Chugh
Jan 23 · 2 min read

We extract information from a website by copying and pasting the info from there to our file. However, this manual process of extracting data can be cumbersome if we want to obtain large amounts of information from a website as quickly as possible. In such a situation, web scraping helps as it is a method to download large amounts of data from websites using codes or API.

Python is the most popular language used for web scraping as it has libraries like Scrapy, Beautiful Soup, and Selenium that make scrapping websites a cakewalk. Web scraping involves inspecting the web page and finding the suitable HTML markers associated with the information we need using the pandas libraries like Beautiful Soup or Selenium to scrape the HTML page. After this step, we manipulate the scraped data in the form we need using the pandas library.

The packages we need for this task are requests, pandas, and beautiful soup.

  • The requests package allows us to connect to the site of our choice. In this example, we want to connect to the IMDB top 1000 movies webpage.
  • The Beautifulsoup4 allows us to parse the HTML of the site and convert it to a beautiful soup object, represents the HTML as a nested data structure.
  • The Pandas package allows dataset manipulation.

The below code can help understand the concept.

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating"
# makes a request to the web page and gets its HTML
r = requests.get(url)
# stores the HTML page in 'soup', a BeautifulSoup object
soup = BeautifulSoup(r.content,'lxml')

We wish to extract the movie title, the release year of the movie, IMDB rating, runtime, and genre from the IMDB webpage. To check the required tags that contain the information we need, right-click on the browser and select Inspect.

Using the code below, we can download the required information.

#Extracting text: Movie titles and release year
Titleandreleaseyear= soup.findAll('h3', {"class":'lister-item-header'})
titles = [movie.find('a').text for movie in Titleandreleaseyear]
release = [movie.find('span', class_='lister-item-year text-muted unbold').text for movie in Titleandreleaseyear]
#Extracting audience rating
Rating = soup.findAll('div', {"class":'inline-block ratings-imdb-rating'})
imdbrating = [i.find('strong').text for i in Rating]
# Extracting Runtime
Runtime = soup.findAll('span', {"class":'runtime'})
a = len(Runtime)
Movieruntime = []
for i in range(a):
Movieruntime.append((Runtime[i]).text)
# Extracting Genre
Genre = soup.findAll('span', {"class":'genre'})
a = len(Genre)
Genres = []
for i in range(a):
Genres.append(((Genre[i]).text).replace('\n', '').strip())

The data extracted from the HTML is in an unstructured format; hence it needs to be converted to a structured form using the below code.

#pandas dataframe        
moviesdata = pd.DataFrame({
'movie': titles,
'year': release,
'Runtime': Runtime,
'imdb': imdbrating,
'genre': Genres})
#Create csv file named 'movies.csv'
moviesdata.to_csv('movies.csv')

We explored the basics of web scraping in this article, and we got a high-level understanding of the concept. Click 💚 if you like the article. If you have any questions, you can write them in the comments section below, and I will do my best to answer them.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Akshita Chugh

Written by

I am a Data Analyst and Consultant who likes to write articles.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store