How Search Engines Process Queries?
This article is part of the series of articles covering the fundamentals of search engine. Read other articles here: Why do we need mobile search? and Deep linking — A basic understanding
Background
Search engines contain internet bots knows as web crawlers which browse the world wide web. The crawler downloads web pages and hands them off to the indexer. Web crawlers start with a list of URL’s call the seeds. When the crawler visits these Url’s, it adds all the links on that page to the queue for subsequent crawling.The crawler also does different functions like determining the frequency to visit pages, making sure there are no duplicates in the subsequent queue list etc etc.
The pages that are downloaded by the crawler are then indexed in the search database. The indexer sorts every word on every page and stores the resulting index of words. It generally has entries consisting of a set of documents in which the entry appears and also the location within the text where this particular entry occurs. Imagine the index to be like a dictionary, except for each word instead of its meaning , the index contains a list of documents where the entry can be found called postings. This makes it easy for the engine to dissect the query and look for documents that contain user query terms. This indexing is sometimes ignored for common words that occur very frequently like the articles and punctuation in the english language as they do not really help while filtering search results.

Now comes the important question: How does the engine process the query and displays relevant results?
We learned in the previous article that web pages that are indexed are displayed to the user as SERP’s. The step by step query process is as follows:
1. The query is generally taken from a UI search box where the user enters the search query.
2. The query may be pre processed or transformed using techniques like spell checking or query suggestions, query expansion (or adding more terms to the query) etc.
3. The query is then parsed and sent to the index servers.
4. The query processor then compares the search query to the index and retrieves the documents that it considers relevant.
5. The results are generated as snippets showing how queries match documents and also highlighted and displayed to the user.
Sounds simple right? But here is the tricky part.
Imagine you searched for “How do search engines query?” and the engine returns pages containing the words one or more of the words “search”, “engine” and “query” placed randomly in the text of the document. Will that serve your purpose in answering the question asked? Also you want results that are relevant and that can answer your question and not millions of documents that contain some or most of the words in the query.
Since there are millions of documents, it is now the job of the query processor to display results that are most relevant to the user and maybe are also contextually relevant (Queries might return documents which just contain the query words and do not serve the users purpose). The success of a search engine depends on how fast a search engine can answer queries and how effectively.
Traditionally it was just word search and now we have evolved to semantic search where the intended meaning of the query is interpreted to display SERP’s. Each search engine has its own secret sauce to display relevant results to the user. Google uses the page rank algorithm which is defined as follows.
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
Google considers a number of factors in computing the PageRank and determining which documents are more relevant to the given query. Some of them could be the frequency and location of the keywords, how long the web page existed, the number of web pages that link to the page in question and many more such attributes.
PageRank is just one of the many algorithmic metrics that influence your page’s rankings in the SERPs. The pages that contain the search query near to each other might also have more relevance than the ones that contain the search query terms scattered across the web page. Popular search words can be optimized on the web pages (For Ex: in the URL, in the title of the page, in the body, in links to the page) so that they appear top in the search results. Many other strategies, techniques can be used to increase the ranking of a particular page in the SERP’s and this method is popularly knows as search engine optimization. Google also tries to understand the relationship and associations in the stored data. It uses google instant to predict your search and queries before you finish typing. It also built a “knowledge graph” which is a semantic- search information gathered from a wide variety of resources to improve the SERP’s. I will write more about the knowledge graph/entity graph in subsequent articles.
Other search engines like bing yahoo!, have their own algorithms to display the SERP’s which serve the user’s query but google pioneered the search market with a whooping market share of more than 50%.