Search Re-ranking in non-scoring Search Engines
Problem Statement:
Search engine’s relevance works based on different type of scoring models. Retrieval of a document in a query can be represented as a score at a document level based on relevance. Search engines such as solr, elastic search, vespa follows weighted scoring model(better & most followed) for retrieving the documents based on relevance.In few of the search engines like endeca, algolia search relevance is based on tie breaking model. Re-ranking documents based on different external attributes is challenge.
In this blog will be discussing the below problems.
- How to do re-ranking of search results based on external signals in non-scoring search engines?
- How to make the solution generic irrespective of search engine?
How non-scoring Search Engines works?
Scoring Search engines generally uses TF-IDF or BM-25 algorithms for retrieving relevant documents based on query. Search retrieval algorithms have evolved and vector based search similarity searches are in place for retrieving relevant documents. In this blog we are going to discuss about search engines which doesn't offer a scoring model.
Below are the set of features based on which relevancy of a query is determined in a non-scoring search engine.
- Words: Will rank higher an object matching more of the words typed by the user if the query contains more than one term.
- Attribute: Will rank higher a match in a more important attribute (Title, Description).
- Proximity: Will rank higher words close to each other (Skimmed milk is better than Skimmed Semi milk).
- Typo: Will rank higher a word that doesn’t contain a typing mistake.
- Position: Will rank higher words that are at the beginning of an attribute rather than in middle or end.
- Exact: Will rank higher words that match exactly without any suffix
Non-scoring Search engine applies the above set of features and applies the relevancy ranking. Pipeline varies from one search engine to another and tie-breaking model usually are unique for each of these search engines. Let’s discuss in next section on re-ranking to be based on external signals.
Search Relevancy — Tie Breaking Model
- For a search Query, based on the relevant fields search engine retrieves corpus(C) of results from the complete collection/index(D).
- Now documents are ranked and classified based on the relevancy model defined in a search engine. Once when it goes through the first criteria corpus (C) is further separated into a subset of corpus (C1 C2 …)based on the matches
- Next set of relevancy parameters are applied on the same corpus (C1) and further results are grouped granular (C11,C12) and ranking of the results happened between C1 & C2 in first tie breaker is always maintained till all the ranking criteria are applied till the end.
- After applying all the criteria final relevant results are retrieved. Order in which the criteria applied varies from search engine to another.
How to influence Relevancy using external signals?
In modern search engines(solr/elastic) integrating external signals/ranking models/attributes to make search relevant is much easier and simpler. We are not going to discuss on influencing relevance in weighted scoring models. In a tie breaking models how are we going to integrate the external signals?
Search relevancy can be influenced by number of external signals.
- Customer behaviour — Clickstream data, session data , user behaviours/purchases can always be factored in while customer tries to get the relevant results.
- Product attributes — Dynamic product attributes can also be included as one of the major signal which influences shopping pattern of the customer.
- Current Trends — Based on the seasonal/regional influence from different sources relevance can be changed.
Add external signals as an attribute in existing relevancy
One of the option is to use these signals as an attribute in existing search relevancy as mentioned below. One of the tie breaking layer can include these signals to re-rank in existing relevancy.
Below are the list of disadvantages of this approach
- Calculations done on the external signals loses its accuracy/value once it is fit into the tie breaking model.
- Adding multiple signals has issues in defining the priority and with in search engines need to choose dynamically hundreds of these signals, which is a limitation.
- Including these ranking algorithms within search engines creates dependencies on search engines and few of the search engines has limitations on including run time signals.
- Lot of open source ranking models are available. So ideally these ranking solutions should be agnostic of search engine.
Ranking outside search engine
Ranking algorithms can be built using various signals like add to cart count, impression of search term, position, sales scores of the item, promotions of item ,etc. Algorithms also includes hotness of the clicks, freshness of the products.
- When a query is served by the search engine, it process the query and retrieves the relevant documents. Based on the relevance algorithm defined within the search engine the results are ranked and send as a result from the search engine
- After retrieving the results, these documents needs to ranked or needs to be scored.
Scoring Algorithms
- For each position(i) of the product(p), positional score needs to be computed(s).
- Identify the range in which the score needs to be distributed. This range can change within each category of queries.
- Based on the position of the product for a search term and range identified, compute the score of product for each query results.
- In the above expression, alpha and beta values needs to be computed at each and every categories to distribute the scores at each position of the search result.
- This scoring algorithm doesn't depict difference between a relevant and irrelevant product fairly. It’s purely a distribution based on the position of the items returned from search engine
- In this approach the external signals plays a major role in ranking post relevancy.
- Distributed positional score is used along with the scores from the clickstream/session ci and product scores based on dynamic attributes pi.
- These cumulative scores are used finally to rank the products.
In this approach, integration of models or external ranking algorithms is made possible without building dependency on the search engine. Also extending to ranking algorithms doesn't affect to core search relevance heavily. Since distribution of score is based on bounded limits the bias of external ranking is very high in this approach. On a positive side, if there is a plan to migrate to scoring engines(Most followed & preferred ones) changes to ranking algorithms are very minimal.
In the series of search scoring, in next set of blogs will be discussing different re ranking approaches in scoring based engines(Solr/Elastic).