Scaling Page Scraping with Apache Flink

Published in

JW Player Engineering

5 min readFeb 19, 2020

JW Player’s Article Matching product allows publishers to automatically embed contextually-relevant videos from their library into their articles. When a viewer visits a page, our proprietary Article Matching algorithm matches metadata we have scraped from that page against videos in the publisher’s catalog. With the results of the search, a playlist of videos most relevant to that article is displayed.

With hundreds of thousands of domains in our network, the question becomes: how do we scale the scraping of webpages in order to retrieve the necessary page metadata to drive our Article Matching algorithm?

In this post, we will discuss how we utilize Apache Flink to crawl and scrape the web.

Scraping with Flink

JW Player is not the first to use Flink for web crawling. At Flink Forward SF 2018, Ken Krugler presented an open source, continuous, and scalable web crawler called flink-crawler.

For our implementation, we had a few requirements:

1) A Time To Live (“TTL”) with Exponential Backoff — in order to capture article updates after initial publication, we wanted to institute a TTL that allowed us to scrape a page multiple times over a given window.

2) Targeted Crawling — we wanted to crawl only within a publisher’s domain and not start scraping the entire web.

3) Respect robots.txt — as with any web scraper, we wanted to respect a site’s robots.txt, which gives instructions about their site to web bots, including which pages not to scrape.

We decided to break up the platform into two Flink jobs:

1) A page URL filter job that determines when to scrape a page; and

2) A scrape-and-crawler job that scrapes and crawls the domain (“scrawl”)

The Page URL Filter job

To start, we need page URLs to seed the platform with. Fortunately, JW’s Video Player emits a wide range of useful data including the page URL that it’s embedded on, which can serve this purpose. We already had a Kafka topic storing this data, and so our first Flink job used this topic as its source. Data from this topic is fed into a FlatMapFunction (1) that contains a MapState. The MapState is as follows:

Key: the normalized page URL. We use a modified version of the RFC 3986 URL normalization standard.
Value: a POJO containing metadata such as the total number of scrapes and timestamp of the last scrape.

Each normalized page URL passed into the FlatMapFunction is checked against the MapState (2). If it doesn’t exist, we add it to the state. If it does exist, we use our TTL logic with exponential backoff to determine whether it’s time to re-scrape the page. We do this by comparing the total scrape count and timestamp of the last scrape against a predetermined schedule.

An example: If the current scrape count is 1, we wait one hour (based on the timestamp of last scrape) before scraping again.

Any page URL that makes it through the filter is passed to the output Collector. The output is sent to a Kafka sink to populate a topic (3).

A possible alternative to our TTL solution would be to schedule Timers for future scrapes. Within a ProcessFunction, you can use the TimerService to register a callback for future processing. So in the above example, after the first scrape we could register a Timer to trigger another scrape 1 hour later.

The Page Scrawler Job

Our second Flink job consumes the output topic of the Page URL Filter job (4). First, we check against the site’s robots.txt (7) to make sure we are permitted to scrape the page. Rather than making a request to download it every single time, we pull the robots.txt (5) and cache it in a MapState of a KeyedProcessFunction (6). The state has a 24 hour TTL, meaning we will re-download the robots.txt once every day.

Page URLs that make it passed our robots.txt filter are set to be scraped (8). Rather than including the scraping logic inside our Flink code, we split it out into a separate API service. The primary motivation for that was so the scraper logic could be utilized by other applications besides Flink.

We utilize a RichAsyncFunction to make non-blocking API requests (9–11). The returned response contains the page’s metadata including open graph protocol tags, as well as keywords retrieved using Newspaper’s NLP feature. Also included in the response is a list of hyperlinks that were on the page (“outlinks”). These links have been pre-filtered to return only links within the same domain as the page it was sourced from. This prevents our crawler from wandering outside the publisher’s domain.

The page metadata results are sent to a Kafka sink topic (12). This topic is then consumed by our Article Matching alogrithm’s pipeline to update its results (14). We send the outlinks retrieved to a different Kafka topic that closes the feedback loop (13).

The Feedback Mechanism

One limitation of using data sourced from the Video Player is that we are only scraping pages that already have a video on them. The real power of the Article Matching product is that we can automatically place videos on pages that don’t already have one, saving editorial time and resources in the process. Our publishers report that 60%+ of their articles do not have video content. This is a missed opportunity to monetize.

Our solution is to take the outlinks and feed them back into the platform. In addition to the Page URLs ingested from the Video Player sourced Kafka topic, the Page URL Filter job consumes the outlinks topic. Once consumed, the URLs are cached and subject to the same TTL. This way, we can now scrape a publisher’s entire site instead of just pages with a JW Video Player on them.

Conclusion

Flink’s statefulness and async capabilities make it the ideal solution for scaling our web crawler. With the scraped page metadata, our Article Matching product has been helping our publishers deliver more relevant content to viewers. Additionally, we can identify what pages don’t have a video on them and help our publishers add contextually relevant media to them.