Web Scrapping IMDb using R

I hate static data!

Published in

Analytics Vidhya

3 min readDec 5, 2021

In this blog, we will learn how to scrape data from IMDb using R. We will use the package “rvest” for scrapping. It would be best to focus on how the URL logic is written and how multi-pages are handled, and I will be extracting title, IMDB rating, and genre.

Building the URL

Step 1: Visit the website — https://www.imdb.com/search/title/

Step 2: Based on your requirement select the things you want in your search.

I have chosen the following:

Step 3: After clicking the search button — note down how many titles you have got. — In my case I got 45,092 titles.

Step 4: Copy your URL.

My URL is

https://www.imdb.com/search/title/?title_type=feature,tv_movie,tv_series,documentary&num_votes=1000,

When we go to the next page carefully observe what happens to the URL.

Feature Film/TV Movie/TV Series/Documentary (Sorted by Popularity Ascending) - IMDb

Animation, Action, Adventure | Post-production When the Justice League are captured by Lex Luthor, Superman's dog…

www.imdb.com

So the logic is we have to write a logic to iterate the pages based on this number in bold -> “,&start=51&ref_=adv_nxt”

Let’s Code

Step 1: Install and load the package rvest and xml2

Step 2: Generate a list of pages

You will see the pages that our scrapper will visit.

Step 3: Looping over all the pages, and extracting data.

After running the code our data frame will look like this.

Note: To understand how the internal scraping logic is written refer to this brilliant article in the reference.

Our focus in this blog is the scraping multiple pages based using URL manipulation.

Conclusion

We have extracted the titles, IMDB rating and genre from the website. This code will take a while to run. I hope you liked it. Will be very happy to connect with you.

You can reach out to me over LinkedIn or E-mail

References:

IMDb Scraping and Visualization with RStudio

Hello future data scientist!^^It’s good to be back!

medium.com