The hidden magic behind search engines, and how they work.

Alex Kennedy
12 min readSep 26, 2019

--

Among his achievements in science fiction (most notably in 2001: A Space Odyssey), Arthur C. Clarke has remained celebrated due to the three laws he formulated to try and understand the future. His third law, conceived in 1973, posits that “any sufficiently advanced technology is indistinguishable from magic”.

An example of one of these advanced technologies can be found in tools that are used daily: search engines. At a basic level, search engines take a query and use it to return records from an index, ordered from the most relevant to least.

When searching, users expect human-like relevance alongside machine-like speed, across many potential results. Search engines have to try and deliver on all of these expectations. However, unlike database queries, search queries are long-tailed. This means they are unsuitable for serving pre-prepared responses to save time, a process also known as “caching”.

By seemingly reading the user’s mind and turning a few words into a set of relevant web pages, Clarke’s law can be applied to search engines. At first glance, it might as well be supernatural forces that are at work behind the scenes.

The magic of search engines lies in how they redefine relevance, time after time, and extremely quickly.

A perfect search engine would mimic person-to-person interaction by making suggestions that a human would deem relevant, after analyzing every single potential result. Thanks to advances in cloud computing, at Sajari we have been able to improve how large index volumes can be queried at speed while remaining accurate.

This convergence of processes needed to make search engines work sounds huge, but when you break down a search query into a series of steps, the operation becomes easier to comprehend.

But first…

What is a search query?

A search query is a piece of information used as an input to find a set of records that are most relevant to it. The most common search queries are just a string of characters typed into a search bar, but search queries can take many forms.

Matching candidate resumes to open job positions, searching with latitude and longitude coordinates for local events and pairing profiles on dating sites are all examples of queries used to find relevant data. You can then boost or restricted variables in the search algorithm, to influence how records are ordered.

The queries above are examples of structured and unstructured queries that are used to search within massive datasets. Structured queries use the variables present in the underlying data structure to search within it. For example, filtering with the available fields in a database, or the colors of products on a fashion site.

Unstructured queries are not directly mapped to the searchable data’s attributes and need to create structure from the query input. One method of unstructured query understanding is natural language processing, as it can extract useful information from a search query to search beyond raw text relevance.

Typing into a search bar — a well-known example of an unstructured query.

For example, a web search query for “sporting matches near me next week” is unstructured and a search engine needs to add structure to it to return more accurate results. By classifying “sporting matches” as an event, “near me” as a geolocation filter and “next week” as a timeframe, these newly structured filters can search through the indexed data to find the best results.

As 80% of data is expected to be unstructured in the next five years, search queries must evolve beyond trying to create relevance from simple keyword matching, to incorporating more contextual business-side data in search algorithms and a more robust level of query understanding.

So, how then do search engines take an unstructured query and use this to return the most relevant search results?

To show how search engines work, and when different processes come into play, we will demonstrate the beginning to end of a search query in the Sajari engine, from the moment a user begins typing their query right up till when the most relevant search results are returned. Quite unbelievably, this entire process happens in 0.009 seconds on average.

We’re going to take an example query “the big appel cheap hotels” and search for this misspelled query on the fictitious website www.americancities.com.

As a bit of context, the search query we demonstrate below will be a query in a standard Sajari Website Search Integration, using the Sajari crawler to index content. Complex search queries that use more data points to influence search result ranking are indeed possible, but for the sake of clarity (and the length of this blog post), we’ll demonstrate our most simple query process.

What happens during a search query?

Step 0: Content is indexed

Before it can even start determining result relevance, a search engine needs to be able to analyze each record you want to search through. You might think that search engines work by heading out and analyzing live pages to return the best results, but this is not correct. A copy of each record is needed to perform incredibly fast search queries, and to customize how search algorithms determine relevance.

Have you ever wondered why Google has so many data centers around the world? You guessed it, these store multiple copies of every accessible page online to determine relevance for search queries. If this information wasn’t stored and optimized to be searched at speed, simple Google queries would take years to process.

For most site search applications, the Sajari crawler will index all accessible pages from a website. The advantage of crawler-based indexing is that you don’t need to worry about connecting to an API and writing code to index your content.

After removing all unnecessary meta fields (like headers, footers, ad-blocks, navigation, etc.) the crawler stores each page’s HTML in the index to become the raw text that will be used to determine relevance to the terms of a search query. When indexing, the crawler will also run multiple algorithms over each indexed record and add extra fields and information that is helpful when querying.

If you have boosts rules active, the crawler will also mark each of your results with how much it should be boosted by, and stores this with each record. These are called “index time boosts” as they’re calculated and stored in the index with each record, but these come back into the mix in a few steps.

When setting up their Sajari account, the marketing manager for American Cities added their domain americancities.com to their Sajari Console. The Sajari crawler indexed their site and analyzed a copy of the HTML from each page. The marketing manager then added a boost rule where every result that contained “/cheap-hotels/new-york/” in the URL was to be boosted by 20% in search queries.

Step 1: User types query

The first step is fairly simple — the user types their query into the search bar. When you display autocomplete suggestions in a drop-down menu this runs a mini query to display the best suggestions in terms of relevance. A user can select one of these suggestions to prefill their query to speed up the process, but otherwise, users can type out their entire query as normal and hit return or click ‘search’.

Jane heads to the website www.americancities.com to conduct some research for her upcoming vacation to New York City. She types in her query “the big appel cheap hotels” but doesn’t realize she has misspelled the keyword “apple”.

Step 2: Query is spellchecked

Once a user has typed their query, this string of characters is sent to a query pipeline. The pipeline is where the search algorithm is built that determines how results are to be made relevant. Using a series of configurable steps, the pipeline dynamically constructs an engine query using data variables in a search algorithm. Once this is ready, it gets sent to the engine to extract the best results.

The pipeline is where the magic really happens. These dynamically generated algorithms are highly complicated queries that are executed on the contents of an index, often with thousands (or millions) of records. The magic is how quickly these algorithms are created and seem to read the user’s mind to return the best search results.

The pipeline can also call external services to improve search relevance — our Spellcheck service is one of these. Spellcheck will analyze any of the words in a search query to see if they have been misspelled, and will also account for the context they are used in (i.e. words surrounding the misspelled keyword).

This process uses a probability matrix that runs different variations of a misspelled keyword concurrently to determine which is more likely to lead to more useful results. This is calculated based on the content of your search index and more generic language models. So if you have a brand name on your site that isn’t known anywhere else, this will still be corrected.

The Sajari pipeline receives the query “the big appel cheap hotels” and sees that the word “apple” has been mistakenly typed. The pipeline creates an alternate query that is spelled correctly, and this is determined to lead to a more likely outcome. This alternate query is given higher weighting when determining search results.

Step 3: Synonyms are applied

After each keyword has been corrected, the pipeline then checks to see if synonyms have been set for any terms in the query. Synonyms help query keywords to reflect the content of an index, so the words on web pages can be matched to the search patterns of your users. Synonyms are particularly useful when people search for colloquial brand names or nicknames, and the content of the site is spelled or phrased differently — for example, “pants” = “trousers”, “ATM” = “automated teller machine”.

When the pipeline applies each synonym, this creates an alternate query that runs in tandem to the original query. Another probability matrix then determines which query is more likely to lead to the most useful results. This ensures that any useful results that contain the original query, without the applied synonyms, won’t be discarded when relevance is determined.

In the pipeline, a synonym has been set where every query for “the big apple” relates exactly to the keywords “new york city”. This creates the query “new york city cheap hotels” and keeps the existing query “the big apple cheap hotels”. “New york city cheap hotels” is determined to more likely lead to a positive outcome.

Step 4: Machine learning optimization

The next step of the pipeline is to determine how much of the relevance algorithm is to be allocated to machine learning optimization. Sajari uses reinforcement learning to constantly optimize useful records to ensure the best result order. Reinforcement learning learns from historical data and factors this into a list of options to preference those results that have led to a positive outcome.

When a user is presented with a list of results and clicks one, feedback to the underlying index data will allow the algorithm to learn this is a good result. The engine might then reward this result with a higher ranking if the same query is searched for again in the future.

Machine learning has learned that there is one particular result, titled “The Best New York City Cheap Hotel Deals”, that a lot of users have clicked on previously from the query “new york city cheap hotels”. When this query leaves the pipeline and the best results are returned, this result will be boosted.

Step 5: Apply boost rules

As the last step of this example pipeline, editable weighting allows you to choose how much of the search algorithm should be weighted towards boosts that were set when pages were indexed in Step 0. Boosting is configurable, which means you have the option to boost on query matches in the title, exact matches in the description, pages that are recently published — anything that you can think of. Boost rules are a great way to influence an entire section of your results, without needing to adjust individual queries and results on a micro-level.

In the americancities.com search index, there is a boost rule where results that contain /cheap-hotels/new-york/ in the URL are to be boosted by 20%. When the search query leaves the pipeline and goes to the engine, the results that match this boost rule will be lifted higher in the results set.

Step 6: Results returned from the engine

The pipeline has now built a search algorithm that will be used to query the index and return the best results. It’s in the engine service where the search algorithm traverses through the index’s stored content to determine the most relevant results. Indexes in the millions can be sorted through to find the best results from a search query — all in the blink of an eye.

Using the search algorithm created in the pipeline, results that more closely match this algorithm will receive a higher ranking in the results set. So if there is a result that has an exact match to the query, has a boost rule active, and has previously outperformed other results using machine learning, this result will be very high in the list.

The search query “new york city cheap hotels” is executed, and relevant results are boosted with machine learning optimization and user-defined boost rules. The engine knows that the result “The Best New York City Cheap Hotel Deals” has been clicked on by a lot of users in the past. As this result contains an exact match to the query in the title AND has been boosted by 20% with a matching URL rule, this result will be ordered first. Following on from this are the next most relevant results.

Step 7: Results sent back to the user

As the final step, each of these search results is sent back to the end user in the results page. Each of these results is displayed and given a tracking token to learn which result a user found most useful. Usually this is by tracking which result is clicked, but this can also record which result leads to another positive outcome like a sale, sign-up or article share.

The end-user is given a list of results, determined from most relevant to least relevant. Jane clicks on the top result, “The Best New York City Cheap Hotel Deals”, and successfully books her trip to The Big Apple.

Conclusion

So there you have it — that was a very simple example of how Sajari turns a search query into an ordered list of relevant results. Remember, what is most impressive about this process is how quickly it happens — all of this usually only takes 0.009 seconds. Depending on the user’s proximity to our data centers, this process can sometimes happen even faster.

Keep in mind that this was a search query for a standard Sajari Website Search Integration. With Sajari you can customize and build your own unique search algorithm using a library of pre-built steps and can decide how important each step is in your pipeline.

For example, you can run a live A/B test on your ecommerce store where one pipeline boosts pages with a higher profit margin and more stock on the floor, and another pipeline boosts products with a high customer review rating. You can live test your search pipelines and can see which algorithm has a greater impact on your bottom line.

You could even wield conditional logic to execute certain steps in your algorithm, so certain steps or weightings can be followed only when other various conditions have been met.

Hopefully, we’ve broken it down and helped explain what happens during a search query. It’s quite a difficult process to get your head around, but once you know how the magic happens it can guide you to building and maintaining the most relevant search possible.

By weighting certain fields to be more or less important in the pipeline, you can adjust how search results are made relevant and ordered to each of your users. The magic is now in your hands.

Looking to improve your site’s search?

Sajari is a fully-featured search platform for your site, ecommerce store or app that includes machine learning-powered results, powerful analytics and fully flexible interface options. Sign-up for a free 14-day trial today or contact us at sales@sajari.com for more information.

Originally published at https://www.sajari.com on September 26, 2019.

--

--