Hacking the System Design: How Search Engines Understand and Deliver Results

Sneh Patel
5 min readDec 4, 2024

--

Search engines are the backbone of the digital world, instantly connecting users to the information they seek. But how do they interpret queries and deliver precise results in milliseconds? This article unpacks the sophisticated systems behind search engines, covering query rewriting, ranking algorithms, and the critical role of distance metrics.

What is Search Engine?

A search engine is a system that indexes vast collections of data, enabling users to find relevant content by submitting queries. It processes these queries and displays the most relevant results on the Search Engine Results Page (SERP).

What Powers a Search Engine? Exploring the Core and Types?

Online searching involves retrieving specific information through computers or networked devices, leveraging massive databases accessible via the Internet. This capability emerged in the 1980s with advancements in database speed and terminal technology. As the Internet flourished, search engines became indispensable tools for navigating vast volumes of data.

Google, the dominant player, uses robust algorithms to aggregate data from various sources, streamlining the process of finding information quickly. By offering targeted results, search engines ensure users can retrieve relevant data even when its exact location is unknown.

According to WebNots Editorial Staff (2016), search engines can be categorized into three types:

  1. Crawler based search engine — All crawler based search engines use a crawler or bot or spider for crawling and indexing the new content to the search database .
  2. Human powered directories — Depends on human based activities for listings.
  3. Hybrid search engine — Using both crawler-based and manual indexing to list sites in search results.

Further Categorization:

  1. Generic Search engines — Handle all types of results (web pages, images, videos). Examples: Google, Yahoo.
  2. Specialized Search Engine — Focus on specific data types. Examples: Amazon (product searches), YouTube (video searches).
High-end overview of Search engine

First, the user submits a query in the search box.

The search engine then retrieves and ranks relevant data before presenting results on the Search Engine Results Page (SERP). But there’s much more happening behind the scenes.

How does the search engine determine relevance? The answer lies in distance metrics and intelligent algorithms.

Cracking the Relevance Code: How Distance Metrics Deliver the Most Accurate Search Results.

source: Research Paper- “A Comparative Study on Distance Measuring Approaches for Clustering”

from above we can see the beautiful and crips comparison of the different Distance measures.

The key principle of distance metrics is that distance is inversely proportional to similarity.

However, all distance metrics rely on numerical data.

so, the search engine have to convert the all queries into the numerical and also the Databases data into numerical type data.

The very first HACK is the it’s hard to convert the this huge Database into numerical data when every query is searched. so, the search engine do the conversion as prework and store database data into numerical data type.

Another hack is that search engine use the same method to convert the query into numerical data as the database is converted.

Pre-Work Optimization:

To speed up the process:

  1. Pre-Conversion: Databases are converted into numerical formats offline and stored in vector databases (e.g., ChromaDB).
  2. Real-Time Query Conversion: User queries are converted into numerical representations during runtime.
  3. Efficient Matching: Results are tagged with predefined intents, narrowing down the data that needs to be searched.

And then the calculation of the distance between the query and database. Then the sorting will be done on the best distance metrics and the search engine fetch the best/Top results.

But we can see that the sorting on the whole database can be the time consuming process.

Query Re-writing — In this the spelling correction, query expansion, query contraction will be done.

Query Understanding — This will figure out the intent of the query like news, sports, education.

Scaling Search Engine Architecture: The Game-Changing Role of Query Rewriting and Understanding.

Search engine architecture

here, what is typically search engine works under the hood.

Query Rewriting

Search engines ensure accuracy through:

  • Spelling Correction: Automatically correcting misspelled words.
    Example: Searching for “data scince” shows results for “data science.”
  • Query Expansion: Including related terms.
    Example: “Top DS courses” might return results for both “data science courses” and “data structure courses,” depending on user behavior.

“Ever wondered how Google knows what you’re searching for even if you spell it wrong?”

let’s see this by example query = “ Top DS courses”. so, you see the search result of the top data structure courses or top data science courses. The Data Structure of Data Science will come as the result it will depends on the user to user in which the user generally search for the topics that will track by the search engine and make effective the search engine.

Query Understanding

This step involves determining the query’s intent (e.g., educational, news-related, shopping) to filter irrelevant data. Techniques like classification algorithms and probability models aid this process.

Metrics used:

  • Accuracy
  • Precision and Recall
  • Root Mean Squared Error (RMSE)

This will help to eliminate the most of the data into the next step of the relative document finding. That search engine have to only check into the data or pages which have the particular intent which is tagged in the database.

Ranking: Sorting Results for Maximum Relevance

Ranking algorithms, such as PageRank and Dynamic Graphs, determine the order of results on the SERP. Metrics like Normalized Cumulative Discounted Gain (NDCG) evaluate the performance of these ranking systems.

NDCG Evolution Metrics:

  1. CTR (Click-Through Rate): Measures the ratio of clicks to impressions.
  2. SSR (Successful Session Rate): Tracks sessions exceeding a predefined dwell time.
  3. DAU (Daily Active Users): Reflects user engagement.

Additional metrics:

  • Dwell Time: Time spent viewing a result page.

Conclusion

Search engines are marvels of modern technology, seamlessly combining algorithms, data science, and user behavior analysis to deliver relevant results. From converting data into numerical formats to employing advanced ranking algorithms, the journey from query to results involves extraordinary computational effort.

As AI evolves, the future of search engines promises even greater personalization, deeper query understanding, and faster, more accurate results. Behind the simplicity of a search bar lies a testament to the incredible power of computational systems.

References

  • Pandit, S., & Gupta, S. (2011). A Comparative Study on Distance Measuring Approaches for Clustering. International Journal of Research in Computer Science, 2(1), 29–31.
  • Kumar, K., & Abhaya, F. D. M. (2013). PageRank algorithm and its variations: A Survey report. IOSR Journal of Computer Engineering, 14(1), 1–8.
  • Sahu, S. (2024). DF PageRank: Improved Incrementally Expanding Approaches for Updating PageRank on Dynamic Graphs*. IIIT Hyderabad, India.
  • Yang, M., Wang, H., Wei, Z., Wang, S., & Wen, J.-R. (2024). Exploring Dynamic Graph Models for Search Ranking.
  • Haris, A. R., Hashim, H., & Sarijan, S. (2024). A Study on Online Search Behavior Using Search Engines. Universiti Teknologi MARA, Malaysia.

--

--

Sneh Patel
Sneh Patel

Written by Sneh Patel

0 Followers

Follow for Tech Content.

No responses yet