Scraping from a website with infinite scrolling.
Suppose you are scraping products from Flipkart and you want to scrape, say 100 products from each category, but you are unable to do using this method as it only grabs the first 15 products from a page.
Flipkart has a feature called infinite scrolling so there is no pagination (like ?page=2, ?page=3) in the URL. If it had such a feature we would have entered the value in a “while loop” and incremented the page values like I have shown below.
page_count = 0
while page_count < 5:
url = "http://example.com?page=%d" %(page_count)
# scraping code...
page_count += 1
So back to infinite scrolling.
“Ajax” enables any website to use infinite scrolling. But that ajax request also has a URL from where the products are loading on the same page on scroll.
To see that URL.
- Open the page in Google Chrome
- Then go to console ; right click and enable LogXMLHttpRequests.
- Now reload the page and scroll slowly. When new products are populated, you will see different URLs named after “XHR finished loading: GET”. Click on them. Flipkart has different types of such URLs. The one you are looking for starts with “flipkart.com/lc/pr/pv1/spotList1/spot1/productList?p=blahblahblah&lots_of_crap”
- Left click on that URL and it will be highlighted in the Network tab of the Chrome dev tools. From there you can copy that url or open it in a new window. (see image below)
- When you open the link in the new tab you will see something like this with around 15 to 20 products per page.
- Now you will say “Oh crap! only 15 products again but I want all the products”.
Don’t Worry. Check the URL there is a Get parameter named ?start=(some number)
Now for first 20 products set the number to 0; for next 20 set the number to 21 and if there are 15 products per page the 0, 16, 31 and so on. Iterate this URL in the while loop like I showed you before and you are done
- Again a problem!! Where are the images dude??
Right click and view page source of that URL, you will see an <img> tag with data-src=”” attribute; that’s your product image..
This is an example of Flipkart.com only. Different websites may have different Ajax urls and different get parameters on the URL.
Some website may also have “JSON” responses in their Ajax URLs. If you find them you wont have to use scraping; just access that json response like any JSON API you have used before.
If you have any doubts please comment below and please share if you like..