Web Scrapping IMDb using R

I hate static data!

prakshaal jain
Analytics Vidhya
3 min readDec 5, 2021

--

In this blog, we will learn how to scrape data from IMDb using R. We will use the package “rvest” for scrapping. It would be best to focus on how the URL logic is written and how multi-pages are handled, and I will be extracting title, IMDB rating, and genre.

Building the URL

Step 1: Visit the website — https://www.imdb.com/search/title/

Step 2: Based on your requirement select the things you want in your search.

I have chosen the following:

Select Title Type
Minimum 1000 Votes
Click on Search

Step 3: After clicking the search button — note down how many titles you have got. — In my case I got 45,092 titles.

Step 4: Copy your URL.

My URL is

https://www.imdb.com/search/title/?title_type=feature,tv_movie,tv_series,documentary&num_votes=1000,

When we go to the next page carefully observe what happens to the URL.

So the logic is we have to write a logic to iterate the pages based on this number in bold -> “,&start=51&ref_=adv_nxt”

Let’s Code

Step 1: Install and load the package rvest and xml2

Step 2: Generate a list of pages

You will see the pages that our scrapper will visit.

Step 3: Looping over all the pages, and extracting data.

After running the code our data frame will look like this.

Note: To understand how the internal scraping logic is written refer to this brilliant article in the reference.

Our focus in this blog is the scraping multiple pages based using URL manipulation.

Conclusion

We have extracted the titles, IMDB rating and genre from the website. This code will take a while to run. I hope you liked it. Will be very happy to connect with you.

You can reach out to me over LinkedIn or E-mail

References:

--

--

prakshaal jain
Analytics Vidhya

MBA Business Analytics, NMIMS, Mumbai (21–23), Former Data Science Engineer at Utopia Global, Inc.