Scraping Infinite List Pagination

7 min readApr 17, 2018

An example of Infinite Scrolling, taken from here: https://dev-blog.apollodata.com/pagination-and-infinite-scrolling-in-apollo-client-59ff064aac61

Scraping pages that use the infinite scroll pattern can be challenging. This guide shows one approach to tackling the problem using Puppeteer.

Intro

Web scraping is a popular (sometimes controversial) option for fetching structured data from web sites that don’t offer a public API.

In the case of traditional web applications, server-side rendered HTML can be fetched using HTTP clients (for ex. cURL, Wget or HTTP libraries) and de-constructed using a DOM parser. Pagination is generally handled by following links or incrementing GET parameters, and the logic can be followed at scale. Due to the low CPU consumption and lightweight payload (initial render HTML) of such scrapers, these apps can be scraped at a high performance and low cost.

Modern web apps that fetch data dynamically in the client side environment typically make paginated AJAX requests to a public API endpoint. In such scenarios, emulating the HTTP calls (for ex. DevTools) can make the task very trivial. In most cases, this is the preferred approach.

However, some web apps require authenticated sessions, use alternative protocols (WebSockets) or nonced API calls that can be challenging to replicate. In these cases, you can run an actual browser (Selenium, PhantomJS, Chrome Headless), and scrape the DOM in the Console to get the desired results. It is possible to automate complex user flows with good reliability (web standards support, low risk of detection) using User Behavior Automation.

An Example

For this example, I will use Quora’s search result page for What is the meaning of life? It should yield enough results for our purposes. The end result will be a JSON array of the following data for each entry.

Title
Excerpt
Link

Note: This is strictly for educational purposes only, please respect Quora’s TOS regarding scrapers (https://www.quora.com/about/tos)

This is what the page looks like:

Entries automatically appended as you scroll down

It looks like it makes fragmented AJAX requests. For the purpose of this article, I will assume the requests are nonced and can’t be reproduced on the server-side easily.

The Strategy

Navigate to a search result page with an actual browser
Identify the selectors for the desired recurring DOM elements
Loop through visible elements
Scrape the data into an ECMAScript Set
Empty the screen contents and scroll the full viewport
Repeat 3–5 until there are no more elements
Serialize the results as a JSON

Parts

Identify Selectors and Extract Entries
Emulate scroll behavior and lazy load list
Loop until all entries are fetched and return JSON
Complete Script
Headless Automation

Helpers

Prep the way for custom logger functions. This is useful when automating via a headless browser because we can override these with custom logger functions in the Script context (NodeJS, Python, Lua, etc…) instead of the browser console.

const _log = console.info,
    _warn = console.warn,
    _error = console.error,
    _time = console.time,
    _timeEnd = console.timeEnd;const page = 1;// Global Set to store all entries
let threads = new Set(); // Prevents dupes// Pause between pagination, fine-tune according to load times
const PAUSE = 4000;

Part 1. Identify Selectors and Extract Entities

At the time of writing, I was able to infer these selectors from this URL: https://www.quora.com/search?q=meaning%20of%20life%3F&type=answer . Since most feeds / lazy loaded lists follow a similar DOM structure, you may be able to re-use this script by simply modifying the selectors.

// Class for Individual Thread
const C_THREAD = '.pagedlist_item:not(.pagedlist_hidden)';// Class for threads marked for deletion on subsequent loop
const C_THREAD_TO_REMOVE = '.pagedlist_item:not(.pagedlist_hidden) .TO_REMOVE';// Class for Title
const C_THREAD_TITLE = '.title';// Class for Description
const C_THREAD_DESCRIPTION = '.search_result_snippet .search_result_snippet .rendered_qtext ';// Class for ID
const C_THREAD_ID = '.question_link';// DOM attribute for link
const A_THREAD_URL = 'href';// DOM attribute for ID
const A_THREAD_ID = 'id';

Scrape a single entry

// Accepts a parent DOM element and extracts the title and URL
function scrapeSingleThread(elThread) {
    try {
        const elTitle = elThread.querySelector(C_THREAD_TITLE),
            elLink = elThread.querySelector(C_THREAD_ID),
            elDescription = elThread.querySelector(C_THREAD_DESCRIPTION);
        if (elTitle) {
            const title = elTitle.innerText.trim(),
                description = elDescription.innerText.trim(),
                id = elLink.getAttribute(A_THREAD_ID),
                url = elLink.getAttribute(A_THREAD_URL);
                
            threads.add({
                title,
                description,
                url,
                id
            });
        }
    } catch (e) {
        _error("Error capturing individual thread", e);
    }
}

Scrape all visible threads. Loops through each thread and parses the details. It returns the thread count.

// Get all threads in the visible context
function scrapeThreads() {
    _log("Scraping page %d", page);
    const visibleThreads = document.querySelectorAll(C_THREAD);if (visibleThreads.length > 0) {
        _log("Scraping page %d... found %d threads", page, visibleThreads.length);
        Array.from(visibleThreads).forEach(scrapeSingleThread);
    } else {
        _warn("Scraping page %d... found no threads", page);
    }// Return master list of threads;
    return visibleThreads.length;
}

Execute the above two scripts in your browser console to get:

If you execute scrapeThreads()in the console at this stage, you should get a number and the global Set should populate.

Part 2. Emulate scroll behavior and lazy load list

We can use JS to scroll to the bottom of the screen. This function is executed after every successful `scrapeThreads`

// Scrolls to the bottom of the viewport
function loadMore() {
    _log("Load more... page %d", page);
    window.scrollTo(0, document.body.scrollHeight);
}

Clear the DOM of entries that have already been processed:

// Clears the list between pagination to preserve memory
// Otherwise, browser starts to lag after about 1000 threads
function clearList() {
    _log("Clearing list page %d", page);
    const toRemove = `${C_THREAD_TO_REMOVE}_${(page-1)}`,
        toMark = `${C_THREAD_TO_REMOVE}_${(page)}`;
    try {
        // Remove threads previously marked for removal
        document.querySelectorAll(toRemove)
            .forEach(e => e.parentNode.removeChild(e));// Mark visible threads for removal on next iteration
        document.querySelectorAll(C_THREAD)
            .forEach(e => e.className = toMark.replace(/\./g, ''));} catch (e) {
        _error("Unable to remove elements", e.message)
    }
}

clearList() is called before every loadMore(). This helps us control the DOM memory usage (in the case of 1000s of pages) and also eliminates the need to keep a cursor.

Part 3. Loop until all entries are fetched and return JSON

The flow of the script is tied here. loop() calls itself until the visible threads are exhausted.

// Recursive loop that ends when there are no more threads
function loop() {
    _log("Looping... %d entries added", threads.size);
    if (scrapeThreads()) {
        try {
            clearList();
            loadMore();
            page++;
            setTimeout(loop, PAUSE)
        } catch (e) {
            reject(e);
        }
    } else {
        _timeEnd("Scrape");
        resolve(Array.from(threads));
    }
}

Part 4. Complete Script

You can run and tweak this script in your browser console. This should return a promise that resolves a JS array with entry objects.

(function() {
    return new Promise((resolve, reject) => {// Class for Individual Thread
        const C_THREAD = '.pagedlist_item:not(.pagedlist_hidden)';
        // Class for threads marked for deletion on subsequent loop
        const C_THREAD_TO_REMOVE = '.pagedlist_item:not(.pagedlist_hidden) .TO_REMOVE';
        // Class for Title
        const C_THREAD_TITLE = '.title';
        // Class for Description
        const C_THREAD_DESCRIPTION = '.search_result_snippet .search_result_snippet .rendered_qtext ';
        // Class for ID
        const C_THREAD_ID = '.question_link';
        // DOM attribute for link
        const A_THREAD_URL = 'href';
        // DOM attribute for ID
        const A_THREAD_ID = 'id';const _log = console.info,
            _warn = console.warn,
            _error = console.error,
            _time = console.time,
            _timeEnd = console.timeEnd;_time("Scrape");let page = 1;// Global Set to store all entries
        let threads = new Set(); // Eliminates dupes// Pause between pagination
        const PAUSE = 4000;// Accepts a parent DOM element and extracts the title and URL
        function scrapeSingleThread(elThread) {
            try {
                const elTitle = elThread.querySelector(C_THREAD_TITLE),
                    elLink = elThread.querySelector(C_THREAD_ID),
                    elDescription = elThread.querySelector(C_THREAD_DESCRIPTION);
                if (elTitle) {
                    const title = elTitle.innerText.trim(),
                        description = elDescription.innerText.trim(),
                        id = elLink.getAttribute(A_THREAD_ID),
                        url = elLink.getAttribute(A_THREAD_URL);threads.add({
                        title,
                        description,
                        url,
                        id
                    });
                }
            } catch (e) {
                _error("Error capturing individual thread", e);
            }
        }// Get all threads in the visible context
        function scrapeThreads() {
            _log("Scraping page %d", page);
            const visibleThreads = document.querySelectorAll(C_THREAD);if (visibleThreads.length > 0) {
                _log("Scraping page %d... found %d threads", page, visibleThreads.length);
                Array.from(visibleThreads).forEach(scrapeSingleThread);
            } else {
                _warn("Scraping page %d... found no threads", page);
            }// Return master list of threads;
            return visibleThreads.length;
        }// Clears the list between pagination to preserve memory
        // Otherwise, browser starts to lag after about 1000 threads
        function clearList() {
            _log("Clearing list page %d", page);
            const toRemove = `${C_THREAD_TO_REMOVE}_${(page-1)}`,
                toMark = `${C_THREAD_TO_REMOVE}_${(page)}`;
            try {
                // Remove threads previously marked for removal
                document.querySelectorAll(toRemove)
                    .forEach(e => e.parentNode.removeChild(e));// // Mark visible threads for removal on next iteration
                document.querySelectorAll(C_THREAD)
                    .forEach(e => e.className = toMark.replace(/\./g, ''));} catch (e) {
                _error("Unable to remove elements", e.message)
            }
        }// Scrolls to the bottom of the viewport
        function loadMore() {
            _log("Load more... page %d", page);
            window.scrollTo(0, document.body.scrollHeight);
        }// Recursive loop that ends when there are no more threads
        function loop() {
            _log("Looping... %d entries added", threads.size);
            if (scrapeThreads()) {
                try {
                    clearList();
                    loadMore();
                    page++;
                    setTimeout(loop, PAUSE)
                } catch (e) {
                    reject(e);
                }
            } else {
                _timeEnd("Scrape");
                resolve(Array.from(threads));
            }
        }
        loop();
    });
})().then(console.log)

Part 5. Headless Automation

Since the script runs in the browser context, it should work with any modern browser automation framework that allows custom JS execution. For this example, I will use Chrome Puppeteer using Node.JS 8.

Save the script as a node module as script.js in the CommonJS format:

module.exports = function() {
//...script
}

Install puppeteer npm install puppeteer and:

const puppeteer = require('puppeteer')
const script = require('./script');
const { writeFileSync } = require("fs");function save(raw) {
    writeFileSync('results.json', JSON.stringify(raw));
}const URL = 'https://www.quora.com/search?q=meaning%20of%20life&type=answer';(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  page.on('console', msg => console.log(msg.text()));
  await page.goto(URL);
  const threads = await page.evaluate(script);
  save(threads);
  await browser.close();
})();

The script should produce an output similar to this:

[  
   {  
      "title":"Does life have a purpose or not? If not, does that give us the chance to make up any purpose we choose?",
      "description":"Dad to son \"Son, do you know that I have been thinking about the meaning of life since I was a little kid of your age.\" His son keeps on licking his ice cream. … \"And you kno...",
      "url":"/Does-life-have-a-purpose-or-not-If-not-does-that-give-us-the-chance-to-make-up-any-purpose-we-choose",
      "id":"__w2_JaoJDz0_link"
   },
   {  
      "title":"What is the meaning of life?",
      "description":"We don't know.  We can't know.  But... … Every religion and every philosophy builds itself around attempting to answer this question.  And they do it on faith because life d...",
      "url":"/What-is-the-meaning-of-life-66",
      "id":"__w2_Qov8B7u_link"
   },...
]