Demystifying a Typical Search Problem Using Inverted Index
If you just have heard about it and don't have much idea, this is for you
An indexing solution is used to quickly search words/text in documents. An index is largely used in:
→ Search engines
→ Big data low latency analytical solutions like Druid / Pinot
A typical web search problem can be decomposed into 3 major components:
- Crawling: This step comprises gathering necessary web content
- Indexing: This involves building the index. In this document, we will review the high-level steps involved in building an inverted index
- Retrieval: This involves fetching the required information from the documents. This usually demands a sub-second response time
Crawling
This is the first step in any indexing process. This step involves scanning the sources for content. Before building inverted indexes, we must first gather the necessary document collection over which these indexes need to be built. These are a few points that need to be kept in mind:
- It should not burden the web servers to impact the actual application
- Many crawlers are distributed systems…