Web Scrapping IMDb using R
I hate static data!
In this blog, we will learn how to scrape data from IMDb using R. We will use the package “rvest” for scrapping. It would be best to focus on how the URL logic is written and how multi-pages are handled, and I will be extracting title, IMDB rating, and genre.
Building the URL
Step 1: Visit the website — https://www.imdb.com/search/title/
Step 2: Based on your requirement select the things you want in your search.
I have chosen the following:
Step 3: After clicking the search button — note down how many titles you have got. — In my case I got 45,092 titles.
Step 4: Copy your URL.
My URL is
https://www.imdb.com/search/title/?title_type=feature,tv_movie,tv_series,documentary&num_votes=1000,
When we go to the next page carefully observe what happens to the URL.
So the logic is we have to write a logic to iterate the pages based on this number in bold -> “,&start=51&ref_=adv_nxt”
Let’s Code
Step 1: Install and load the package rvest and xml2
Step 2: Generate a list of pages
You will see the pages that our scrapper will visit.
Step 3: Looping over all the pages, and extracting data.
After running the code our data frame will look like this.
Note: To understand how the internal scraping logic is written refer to this brilliant article in the reference.
Our focus in this blog is the scraping multiple pages based using URL manipulation.
Conclusion
We have extracted the titles, IMDB rating and genre from the website. This code will take a while to run. I hope you liked it. Will be very happy to connect with you.
You can reach out to me over LinkedIn or E-mail
References: