Sitemap
Tech Wrench

Tech tales, anecdotes and musings…

Follow publication

System Design Interview: Web Crawler

--

Web Crawler

Don’t forget to get your copy of Designing Data Intensive Applications, the single most important book to read for system design interview prep!

Check out ByteByteGo’s popular System Design Interview Course

PREV | HOME | NEXT

Get a leg up on your competition with the Grokking the Advanced System Design Interview course and land that dream job! Or check-out the starter system design course system-design interview prep course.

Web Crawler Design For Interview Questions

If you have a major software engineering interview coming up, one of the most popular system design questions you should be preparing for is — how to build a web crawler. Also known as spider, spiderbot, and crawler, a web crawler is a preliminary step in most applications where several sources on the World Wide Web are to be utilized. No wonder, Google and all other search engines use web crawlers to collect data from websites.

If you are interviewing, consider buying our number#1 course for Java Multithreading Interviews.

What Is A Web Crawler

Web crawling or web indexing is a program that collects webpages on the internet and stores them in a file, making them easier to access. Once it is fed with the initial reference pages, or “seed URLs”, it indexes the web links on those pages. Next, the indexed web pages are traversed and the web links within them are extracted for traversal. The crawler discovers new web links by recursively visiting and indexing new links in the already indexed pages.

Most Popular Applications

Search engines, including Google, DuckDuckGo, Bing, and Yahoo, have their own web crawlers to find and display pages. Other applications include price comparison portals, data mining, malware detection, and web analysis tools. Automated maintenance of websites will also require a web crawler.

Grokking Dynamic Programming Patterns for Coding Interviews

Learn patterns to dynamic programming interview questions to land your next Big Tech job!

What Features Are Involved?

Before actually building a crawler, there are some considerations involved. Let’s take a look:

  • Crawling Frequency: Also known as crawl rate, or crawl frequency refers to how often you want to crawl a website. You can have different crawl rates for different websites. For example, news websites might need to be crawled more often.
  • Dedup: Where multiple crawlers are used, they may add duplicate links to the same URL pool. Dedup or duplicate detection involves the use of a space-efficient system, like Bloom Filter, to detect duplicate links, so your design isn’t crawling the same sites.
  • Protocols: Think about the protocols that your crawler will cater to. A basic crawler can handle HTTP links, but you can also modify the application to work over STMP or FTP.
  • Capacity: Each page that is crawled will carry several URLs to index. Assume an estimate of around 50 billion pages. Assuming an average page size of 100kb:
    50 B x 100 KBytes = 5 petabytes
    You would need around 5 petabytes of storage, give or take 2 petabytes, to hold the information on the web.
    You can compress the documents to save storage since you’ll not need to refer to it every time. For certain applications, such as search engines, you may only need to extract the metadata information before compressing it. When you do need the entire content of the page, you can access it through the cached file.

Don’t waste hours on Leetcode. Learn patterns with the course Grokking the Coding Interview: Patterns for Coding Questions to beat the interview. Or, if you prefer video-based courses check out Udacity.

Design Diagram

Design Diagram

This story is sponsored by Educative.io. Check-out their system-design interview prep course.

Overview

As you can see in the system design diagram, the loop is initiated through a set of ‘seed URLs’ that is created and fed into the URL frontier. The URL frontier applies algorithms to build URL queues based on certain constraints, prioritization and politeness, which we’ll discuss in detail further in the post.

Module 3, which is the URL fetcher receives the URLs waiting in the queue one by one, receives the address against it from the DNS resolver and downloads the content from that page. The content is cached by module 5 for easier access to the processors. It is also compressed and stored after going through the document De-Dup test at module 6. This test checks to see if the content has already been crawled.

The cached pages are processed, going through different modules (8a, 8b and 8c) in the process. All the links on the page are extracted, filtered based on certain protocols and passed through the URL-Dedup test to see if multiple URLs point to the same document and discard repetitions. The unique set of URLs received through the processing modules are fed back to the URL frontier for the next crawl cycle.

Design Components

1. Input Seed URLs

Firstly, your crawler will need ‘seed URLs’. Once it has the initial input, it will continue extracting and storing data recursively. This list of seed URLs or absolute URLs is fed to the ‘URL frontier’.

2. URL Frontier

The job of module 2, the URL frontier, is to build and store a list of URLs to be downloaded from the internet. For focused or topical web crawlers, the URL frontier will also prioritize the URLs in the queue.

3. Fetching Data

Whenever the URL frontier is requested for a URL, it will send the next URL in the priority queue to module 3, the HTML fetcher. The HTML fetcher then downloads the document against the fetched URL, once the DNS resolver gives it the IP address (find details under the next heading). The crawler downloads the file based on the network protocol that the file is running. Your crawler can also have multiple protocol modules to download different file types. The fetcher, also called the worker, will invoke the appropriate protocol module to download the page on the URL.

4. DNS Resolver

Before the HTML fetcher can actually download the page content, an additional step is required. This is where the role of a DNS resolver comes in. A DNS resolver, or a DNS lookup tool, component 4 in the diagram, maps a hostname to its IP address.

Though DNS resolution can be requested from the server, it will take a lot of time to complete the step, given the big number of URLs to be crawled. Instead, the better option is to create a customized DNS resolver, as you can see in the diagram, to complement the basic crawler design. So your custom DNS resolver will give HTML fetcher the IP address of the hostname that is to be fetched. Once it has the IP address, the fetcher downloads the content on the page available on that address.

Grokking Comp Negotiation in Tech

Land a higher salary with Grokking Comp Negotiation in Tech.

5. Caching

Next, the content downloaded from the internet by the fetcher is cached. Since the data is typically stored after being compressed and can be time-consuming to retrieve, an open-source data structure store, such as Redis can be used to cache the document. This makes it easier for other processors in your web crawler design to fetch the data and re-read it without consuming unnecessary time.

Don’t waste hours on Leetcode. Learn patterns with the course Grokking the Coding Interview: Patterns for Coding Questions to beat the interview. Or, if you prefer video-based courses check out Udacity.

6. Content Seen Module

Another aspect to consider is whether the URL content is already seen by the crawler. Sometimes, multiple URLs can have the same content in them. If the document is already in the crawler database, you’re going to discard it here without sending it to storage.

We’ll use module 6, the content seen module or the document De-Dup test, so that the crawler doesn’t download the same document multiple times. Fingerprint mechanisms, like checksum or shingles, can be used to detect duplication. The checksum of the current document is compared to all the checksums present in a store called ‘Doc FPs’ to see if the file has already been crawled. If the checksum already exists, the document is discarded at this point.

7. Storage

If the document passed the content seen test in the previous module, it’s saved into the persistent storage.

Work smart, learn coding patterns to solve interview questions rather work hard and waste endless hours on LeetCode

8. Processing Data

You can have multiple processors in your customized web crawler design, depending on what you plan on doing with the crawler. All the processing is carried out on the cached document, rather than the stored database since it’s easier to retrieve. The three most common processors that are almost always present include:

Grokking the Coding Interview: Patterns for Coding Questions

Work smart, learn coding patterns to solve interview questions rather work hard and waste endless hours on LeetCode

8a. Link Extractor

URL extractor or link extractor can be taken as the default processor. A copy of the crawled page is fed into the link extractor from Redis or any other in-memory data store. The extractor will parse the network protocol and extract all the links on the page. The links could either be pointing to

  • a specific location on the same page,
  • a different page on the same website,
  • or a different website.

A set of normalization techniques will need to be incorporated to make the link list more manageable. The links in the list should follow a standard format to make them easily understandable by the crawler modules. You can:

  • Map all child domains to the main domain. For example, links from mail.yahoo.com and music.yahoo.com can all be tagged under www.yahoo.com.
  • If the components of the link are uppercased, convert them to lowercase
  • Add network protocol to the beginning of the link, if it’s missing
  • Add/remove backslashes at the end of the link

8b. URL Filtering

URL filter receives the set of unique, standardized URLs from the link extractor module. Next, depending on how you’re using the web crawler, the URL filter will filter out the files that are required and discards the rest.

Learn patterns to dynamic programming interview questions to land your next Big Tech job!

You can design a URL filter that filters by filetype. For example, a web crawler that crawls only jpg files will keep all the links that end with ‘.jpg’ and discard the rest. Other than the filetype, you can also filter links by their prefix or domain name. For example, if you don’t want to crawl Wikipedia links, you can design your URL filter to ignore the links pointing to Wikipedia.

It is at this point that we can implement the robot exclusion protocol. Since the URL fetcher will already have fetched a document called robot.txt and mapped the off-limit pages to the URL list, the URL filter will discard all the links that the website does not permit downloading. We’ll discuss the need for robot exclusion protocol later in the post.

The output of the URL filter is all the URLs that we want to keep and pass on to the URL frontier after some further processing.

Big-O Notation For Coding Interviews and Beyond

Brush-up on Big-O notation for tech interviews with: Big-O Notation For Coding Interviews and Beyond

8c. URL De-Dup

URL De-Dup is typically implemented after the URL filter module. The stream of URLs coming out of the URL filter might not be unique. You may have multiple URLs in the stream that point to the same document. We wouldn’t want to crawl the same document twice, so a De-Dup test is performed on each filtered link before passing it ahead.

Your crawler should ideally store a database of all the crawled URLs — let’s call it the URL set. Each URL to be tested is mapped onto each of the URLs in the set to detect a repetition.

Breeze through your coding interviews with Hacking the Coding Interview.

The Bloom Filter

The Bloom filter is a space-efficient option to check if a URL is already present in the database. If it’s already in the crawled database, it’s discarded, and if not, it’s passed on to the next module. However, the Bloom filter isn’t always reliable. Though a false negative isn’t possible, there are chances of a false positive. If the Bloom filter decides, falsely, that the URL is already present in the set, the URL will not be passed to the URL frontier. It’s not a major issue because you might encounter the same URL in one of the next iterations of crawling and it may be added to the list one of those times.

Land a higher salary with Grokking Comp Negotiation in Tech.

Cycle Repeats

After being passed through the URL De-Dup test, the unique URLs are saved to the URL set and also pushed to the URL frontier to repeat the cycle.

Design Considerations

Now that we understand a bit about the system components, there are certain considerations that you also need to look into, to build an efficient and functional crawler. These considerations decide the order in which URLs are returned by the URL frontier. Both these functions, prioritization and politeness will be implemented in the frontier since that’s the module that positions the URLs in the queue.

How To Prioritize

Since there are countless web pages on the internet, your web crawler should be able to prioritize which pages to download for its efficient operation. Priority decides how often the webpage will be recrawled. The high-quality pages that change their content frequently will need to be crawled more often. Priority of crawled webpages, therefore, is based on two factors:

  1. The quality of a page
  2. The change rate of the page

If you’re only going to prioritize by change rate, your crawler will encounter several spam pages that change their content completely each time they’re fetched. It’s, therefore, essential to put some form of quality check as well.

For example, for a news page that updates content frequently, your crawler will need to recrawl the link every few minutes to incorporate any changes.

Don’t waste hours on Leetcode. Learn patterns with the course Grokking the Coding Interview: Patterns for Coding Questions to beat the interview. Or, if you prefer video-based courses check out Udacity.

How To Implement Politeness

The second consideration, also implemented at the URL frontier, is politeness. Your crawler should ensure that it’s not overwhelming the host’s server at any time. There needs to be a delay between repeated fetching requests to a particular server. The delay must be greater than the time taken for the latest page fetched from the host. Also, at any given time, make sure that only a single connection is open between the crawler and the host.

URL Frontier Architecture Diagram

URL Frontier Architecture Components

The diagram explains how these considerations are implemented by the URL frontier.

The URL frontier architecture makes sure:

  1. Pages are crawled according to their priority.
  2. Only one connection is established to a host at any given time.
  3. A wait time is maintained between successive requests to a host.

Two sets of queues are maintained — F number of front queues, as shown in the upper portion of the diagram, and B number of back queues, shown in the lower portion of the diagram. All the queues follow the FIFO protocol.

Grokking Machine Learning Design

Start your journey in machine learning with Grokking Machine Learning Design.

The front queue implements the prioritization, while the back queue implements the politeness. Back queues will ensure that only a single connection is open to a host and also that there is a wait time between successive requests to a host.

Front Queues

When a URL is received by the prioritizer, it assigns it a priority number between 1 and F based on its crawl history. For example, if the URL receives a priority i it’s added to the ith front queue. Each of the F queues will have multiple URLs to crawl, they can also be empty.

Biased Front Queues Selector

Based on their queue number, the URLs are pushed by the biased front queue selector to the back queues. The choice of the selector is biased since it fetches links from higher priority queues first.

Back Queues

There are two features that the back queues maintain:

  1. The back queues can not be empty while the crawl is in progress. As soon as a queue becomes empty, it will ask the front queue selector for the next URL, which could be from a different host, from one of the front queues.
  2. Each back queue will have URLs from a single host only. This feature ensures that no more than a single connection is established between the crawler and the host.

Note: We can only set multiple connections with the host if the host website agrees to it.

Get a leg up on your competition with the Grokking the Advanced System Design Interview course and land that dream job!

Back Queue Table

A table is maintained to address the back queue number to the host such as the one below. With this table we know that, for example, back queue number 4 is reserved for pages from the host wikipoedia.com only.

Whenever a back queue is empty, the biased front queue selector pushes a URL from one of the front queues, from a different host, to the back queues and the table is updated.

Back Queue Heap

In addition to the table, there is also a “Heap” which carries information for each back queue. This information is the minimum time to wait before a host can be contacted again. When the fetcher contacts the back queue selector for a URL, the heap will tell which back queue to pick the link from. Once a URL is crawled, the time against the back queue is updated in the heap.

Back Queue Selector

When a crawling thread is free, the fetcher asks the back queue selector for a link to crawl. The back queue selector contacts the heap to check which back queue can the URL be fetched from and at what time. The crawler may have to wait for the amount of time corresponding to that queue in the heap. The back queue picks the URL from that queue based on FIFO and gives it to the fetcher.

Ace the machine learning engineer interview with Grokking the Machine Learning Interview.

Robot Exclusion Protocol

Most web crawlers implement the robot exclusion protocol, allowing webmasters to put certain restrictions on the content of their website that can be accessed by the crawler. The host carries a special file called ‘robot.txt’ which is required to be fetched by the crawler. It contains information regarding the website pages that are off-limits for the crawler and also the frequency at which the host can be contacted.

‘Robot.txt’ is used by the crawler to check whether each URL fetched from the host passes the robot restrictions. This check takes place at the URL filter module.

Hacking the Coding Interview

Breeze through your coding interviews with Hacking the Coding Interview.

Detecting Updates

If our crawler can know how frequently a website is updating (or changing content) it can adjust the crawl rate accordingly. One way to do so is this:

Instead of downloading an entire page, our crawler can simply make a head request. This request returns the last modified time for the page. If this time is the same as the last crawl time of the page, then the crawler won’t download it.

Detecting Duplicates

Plenty of duplicate content exists on the internet. Web Pages may not be exactly the same, but it’s very likely that your crawler will encounter multiple sites with almost the same content. Spammers who steal content from a blog to make a few changes and add it to a different blog or page are one such example. To save storage space and time, we need our crawler to be intelligent enough to detect such duplicates.

Ace the machine learning engineer interview with Grokking Machine Learning Design.

Now, with millions of pages on the internet, it may not be feasible to check them word by word to see if it’s a copy. Our content seen module already calculates and compares the checksum for every page. However, since the entire content may not be the same, checksum, or hashing won’t detect pages that are almost duplicates. More suitable techniques to detect near duplicate pages include Simhash, which is the one that Google search engine uses.

Crawling VS Scraping

Crawling and scraping are often confused, so it’s good to understand the difference before you head to a major interview. A crawler scans a page, indexes the URLs on it and then goes to the next page to crawl.

Web scrapers, often used in conjunction with crawlers, collect specific information on web pages or other sources to do some processing over it. For example, the price list can be extracted from different sources and saved into an excel spreadsheet using a web scraper.

A major feature that sets web scrapers apart from web crawlers is the focused approach. Rather than scanning the entire content, web scrapers are after a specific portion of the document, from where the information needs to be pulled. Crawler, on the other hand, extracts all the data from pages and indexes the links.

Java Multithreading for Senior Engineering Interviews

Learn the ins and out of Java concurrency and multithreading for senior engineering interviews.

Conclusion

Now that you have a fair idea what a web crawler is and how it’s designed, hopefully you’ll easily be able to give a good answer to any interview question related to web crawler. Of course, you also have the top 5 videos to skim through to get a better understanding on the subject.

Top 5 Videos for Web Crawler System Design Interview

1. System Design distributed web crawler to crawl Billions of web pages
(
https://www.youtube.com/watch?v=BKZxZwUgL3Y)
by
Tech Dummies Narendra L

Comments: 115
Likes: 1.3K
With 83.4K subscribers on the channels, it’s evident that Narendra L has quite some following among software engineers. The 46-minutes video provides an extensive explanation on designing a web crawler from the perspective of interview questions at Google, Facebook, and Amazon. Narendra has put effort and time to go in depth and create one of the most detailed and comprehensible contents on web crawlers, especially for the high-level components.

Land a higher salary with Grokking Comp Negotiation in Tech.

2. Design a Web crawler for indexing pages on the internet (SDE — 2 Interview) (https://www.youtube.com/watch?v=shp63TTI1ok) by The Stupid CS Guy

Comments: 7
Likes: 51
Though the video doesn’t have as much following as the first one on the list, it does a good job at preparing the viewers for a system design interview. It’s a mock interview session in which the interviewee is expected to design a basic web crawler with some additional constraints. You’ll find some key questions you can expect in interviews with detailed explanations.

3. Build your own web crawler (https://www.youtube.com/watch?v=rPjt2ilVMzs) by Ahmed Elfakharany

Comments: 7
Likes: 50

Though not created from the perspective of interview questions, the video does give a detailed explanation on creating an application that indexes content on the web to feed it into further applications.

Master multi-threading in Python with: Python Concurrency for Senior Engineering Interviews.

4. Python Scrapy Tutorial | Web Scraping and Crawling Using Scrapy | Edureka (https://www.youtube.com/watch?v=5o9lucMaQLc) by edureka!

Comments: 5
Likes: 316

Despite the fact that this video is more implementation-specific, it would be useful in gaining a technical overview on designing web crawlers. The video introduces Scrapy as a general-purpose web crawler, how to use it to build a basic web crawler, and store the extracted information in a file. The detailed tutorial walks the viewers through the processes involved in building and running the bot.

5. Web crawling (https://www.youtube.com/watch?v=B994RSzlCww) by Techno Spark

Comments: 1
Likes: 11

Though the channel has a very little following and the video barely has 170 views, it’s a detailed explanation on the operation and design of a basic web crawler. It goes in depth to explain the web crawler architecture based on the block system diagram. Additional constraints like the robot exclusion protocol are also explained well. Though it might not be a great idea to prepare for a software engineering interview solely from this video, it can be a helpful extra.

This story is sponsored by Educative.io. Check-out their system-design interview prep course.

Teach kids to code and have fun at the same time — CodeMonkey!

Your Comprehensive Interview Kit for Big Tech Jobs

0. Grokking the Machine Learning Interview
This course helps you build that skill, and goes over some of the most popularly asked interview problems at big tech companies.

1. Grokking the System Design Interview
Learn how to prepare for system design interviews and practice common system design interview questions.

2. Grokking Dynamic Programming Patterns for Coding Interviews
Faster preparation for coding interviews.

3. Grokking the Advanced System Design Interview
Learn system design through architectural review of real systems.

4. Grokking the Coding Interview: Patterns for Coding Questions
Faster preparation for coding interviews.

5. Grokking the Object Oriented Design Interview
Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions

6. Machine Learning System Design

7. System Design Course Bundle

8. Coding Interviews Bundle

9. Tech Design Bundle

10. All Courses Bundle

--

--

Tech Wrench
Tech Wrench
SystemDesign
SystemDesign

Written by SystemDesign

The ultimate Poor man’s system design interview prep guide -- https://systemdesign.medium.com/membership

Responses (2)