How to scrape API requests ?
Extract data directly from Network Traffic, using XHRs.
I was trying to help someone with a web scraping task yesterday, and stumbled upon an interesting way to use APIs to scrape data from certain websites.
Many sites nowadays use frontend frameworks which render dynamic content by loading a JSON or XML file from their backend. I’m going to show you how to find the URL to access that dynamic content so you can easily scrape data from a site without BeautifulSoup, CSS selectors or XPath. Just clear, simple JSON.
Why scrape network traffic?
You’re on a social media platform and you make a post. Let’s say a tweet. Then your friend comments on it. Rather than making a request to reload the entire page to in order to display your friend’s recent comment, your browser makes a request for only the data related to your post.
This decreases the amount of data that has to be requested, and the end user doesn’t have to experience the whole page being reloaded. This is interesting for scraping data as the data you are after can sometimes be accessed only through one of these requests.
Here we go. You open that website URL you want to scrape.
You access the HTML. But nothing. No data is openly displayed.
Before to get into XHR, lets first talk about one common issue a lot of people make when scraping a page. Check that you’re not stuck there before to get on a deeper level.
Data in the script is not the same as in the HTML
In some cases you may realize that the data returned to you is different from the data you see in the view:source option of your browser. Remember that Inspect Element and View Source are not the same thing:
- “View Source” shows you the HTML exactly as the website returns it on the initial requests.
- “Inspect Element” shows you all the rendered content, including through Javascript and new AJAX requests. Think about webapps.
Take a simple example, say Twitter. If you go to “View Source” you’ll see a fairly small document that mostly loads the required javascript files. But if you Inspect Elements, you’ll see much more HTML code as it has been fetched and rendered.
Well, in this case there is a chance that what you’re missing is a User-Agent Request Header. The User-Agent header specifies the browser being used. When it’s not there, the website might assume that an automated request is being made. But it’s very simple to fix.
Just add a modern browser’s user agent in your requests.
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
But it will not always be as easy. Most of the time, there will be no data displayed inside the page structure. Let’s have a closer look.
The data does not even exit inside the HTML
We will use Chrome Tools to identify which request is responsible for bringing the data we’re looking for.
I will take Airtable as an example as my friend was working on it. Basically this app allows teams to create and share tasks.
I am looking for the request that brings new tweets to the dashboard page. Here are the steps I followed to find the right request:
- Opened Network tab in Chrome Dev tools and press XHR to only see this kind of requests (and not CSS, images, etc.)
- Trigger a new load. Ctrl+R, sometimes it is a “Load More” button. In this case, it’s caused by scrolling down.
- Look through the requests, until I find the right data within them, Ctrl+F
Now you can see that there is some data that comes across the network. If the website loads data dynamically, it typically uses XMLHttpRequests (XHRs).
The browser uses an API to get data from the server. And its exactly what we are looking for. Here, the XHR is a GET request containsing all the data we want to scrap.
Sometimes you can just copy that URL and paste it elsewhere, the data will shop up. But here the system is a bit more complex: there is a unique URL generated for every request to prevent it from being scraped. We will see later how to handle this.
Note: the headless browser renders everything from XHR requests and from content stored within <script> tags, because the page behaves exactly as in the browser. You can use selenium, puppeteer, splash or any other js / headless browser rendering tool. But there are some downside :
- Much slower than single HTTP requests for scraping
- Hard to scale and run parallel requests.
It is an option when nothing else works, or when input is required.
Working with POST requests
In our example, we worked with a GET request.
But what if the website uses POST requests that only work within the context of the loaded page (secured by cookies, headers, tokens..)? Then you can use a headless browser, load the first page and send the POST requests from there.
This approach can be used to get app reviews from Google Play Store. In the image below, you can see a good example of POST request along with its data, which is being sent for each 40 reviews you want to view.
To send a POST request we can use jQuery.ajax() and call it from the context of the page loaded in a headless browser. To get more than 40 reviews, you can iterate the page
parameter (by analyzing the URL structure).
XHR in this case returns HTML data instead of JSON, so we have to parse them and use selectors.
This way you can crawl and extract 4,000 reviews for a given app on Google Play Store in a few seconds.
How to handle 429 HTTP responses
In some cases you’re receiving a 429 or other rate-limiting responses. It’s quite likely that the website understood that you’re scraping it and is trying to stop you.
Scraping at a slower pace may be the solution. Or you can use proxies (as it is an interesting point I will make an article about that later).
Conclusion
Congrats if you made it here. By getting directly into a website’s network traffic, we managed to extract data that is impossible to get via the classical way. Hope you learned some stuff and enjoyed reading this !
Any question ?
Feel free to message me at amine.melbx@gmail.com,
Amine.