How to build a search engine ?

Deepsearch at https://deepsearch.tsignal.io

TSignal has been building a search engine (available at DeepSearch) for the masses, and today we will share our experience into how to build a search engine. This is an ongoing process and we are learning more every single day.

Every search engine does the following four major activities, namely

  1. Crawling
  2. Indexing
  3. Ranking
  4. Query serving

Crawling

Crawling or Web crawling is systematic browsing of the web using bots. The process though seems easy is enough to take-down the project if its not state-of-the-art. The following parameters are taken into consideration while crawling (in no particular order)

  • Url Normalization: Xyz.com = xyz.com ( domain is case insensitive) while xyz.com/A != xyz.com/a (url path is case-sensitive)
  • Content age or freshness or when to crawl : News needs to be crawled every few minutes, while wikipedia can be crawled, once a week
  • Static or dynamic website: It’s easy to create a dynamic website, which can feed random content forever. If the crawler gets stuck in a single website, it’s not a good sign.
  • How much of it changes over a period of time: Again, blog home page may not change for a day/week/month, However, home page of news website changes completely every few hours.
  • How much to crawl : Crawler need not crawl 10 million web-pages of a single website until the web-site is extremely popular.
  • Discard old data: Any page which is not accessible should be discarded and no longer be indexed
  • What to store : It depends on the use-case. for eg. archive.org needs to store images/javascripts etc. However, search engines may skip javascripts. A page which has changed just 5 or 10 words, can be skipped until the next crawl. A search-engine typically is only interested in the text sections.
  • How to store: Again it depends on use-case whether the projects needs to store historical archives.A search engine may/may-not need historical archives. Also, the store needs to be capable of handling millions of read/writes/updates.

A project need NOT invest the initial crucial months to get the crawler up, as the Commoncrawl project gives enough data to test the remaining stages.

Indexing

Indexing means collection, parsing and storing data in a form to facilitate fast and accurate information retrieval. This may sound easy but is extremely complex to do it correctly. The following steps are involved in building up an index

  1. Data type detection:detect whether the downloaded data is HTML page/ text file/ PDF/image and pass into suitable decoder.
  2. Extraction and cleansing: This step involves stripping of HTML tags, JavaScript and other unnecessary regions
  3. Detection of Language: Language of the page must be identified to invoke a suitable token extractor. English token extractor will not work for Japanese, as there are no spaces between words.Language detection is again complex as no standards exists dues to use of wide variety of encodings UTF-8, UTF-16, JIS. Also, there are no page headers, which mention page language. However, the excellent CLD2 library is good enough.
  4. Extraction of Sentences/Phrases/Words : Content on the top of the page should be marked differently than the footer content, as this information helps in ranking.
  5. Extraction of text tokens (n-grams): Extract tokens depending on language based techniques.
  6. Stemming and Lemmatization : Natural Language Processing swimming = swim
  7. Building the index: The index can now be built using the text tokens.
  8. Index Update: How an index is updated, how to purge old-data. Due to massive size of index, index updates are done in batches.Some search engine maintain 2 index, one old, one fresh. and the results are merged at querying time. The old index is updated normally every 3 months, while the fresh index is updated few minutes.

Lets give a rough estimate of data size/numbers from our experience.

A billion web pages on average contain

  • more than 40 billion urls ( including duplicates).
  • more than 100 TB data
  • more than 5 billion unigrams (unique)
  • more than 17 billion bigrams (unique)
  • more than 12 trillion tokens !!

If we are able to encode token position data in say 32 bit word( 4 bytes), Token data would take ~48 TB of data. Throw in data protection/file system overheads, it amounts to ~100 TB data just for token position data.

The index needs to index domain name, url, page title, unigrams, bigrams, freshness of content, domain age, domain popularity and a large number of other parameters for successful ranking. The index generates posting lists which are nothing but a list containing, this word exists in this url and so on. The posting lists can be used to uniquely used to identify each and every instance of text tokens.

Ranking

The index is incomplete, until the ranking is done for all the unigrams, bigrams etc. Ranking is an extremely compute intensive task but can be easily parallelized. There are number of industry standard algorithms such as TF-IDF, BM25 but in our case, we have our own ranking algorithms. It’s basically sorting of posting lists based on number of factors, some are given below (in no particular order)

  1. Domain age: How old is domain ?
  2. Domain expiry time: Estimated time for domain renewal
  3. Unigram appearing in domain name, sub-domain name:
  4. Unigram appearing in page title
  5. Number of pages, in which unigram appears in page titles on a single domain.
  6. Number of pages, in which unigram appears in page text data on a single domain.
  7. Number of pages, in which unigram appears in entire dataset.
  8. Unigram appearing in Header ( H1/H2) or only page footer
  9. Unigram position on page
  10. Page length
  11. Number and quality of inbound and outbound links

We use some of the above.

Query Serving

This is where you serve the results to your users. It comprises of following processing steps

  1. Query Language identification
  2. Extraction of text tokens.
  3. Do analysis of query, detect phrase if possible
  4. Extract and rank results : Due to our state-of-the-art algorithms, TSignal is able to rank the results live every single time. This includes finding and loading posting lists, find intersection, rank intersection results, load the page/url/title, highlight matches and delivery of results.

TSignal currently ignores some factors such as user IP, location etc while ranking as they breach user privacy. TSignal does NOT use human curation at any stage, so the results are unbiased. Since TSignal is privacy focused. We do not use any user-identification techniques, making the stack much simpler. TSignal is able to serve 99 percentile queries is less than 200 ms. This article is just an introduction, we will be taking a deep-dive into each of the steps individually trying to explain how we have done it.

Show your support

Clapping shows how much you appreciated TSignal Admin’s story.