Building Search v1.0 at Inventa

Weber Matias
Building Inventa
Published in
6 min readJan 10, 2023

Building search for a marketplace is not a simple task. Shoppers expect the speed and quality delivered by Google and Amazon. The bar is set so high that we don’t talk anymore about how to build search. Instead, companies like Inventa must craft a “search experience”.

Search is a core driver of our business value. That’s why we chose to build it ourselves. This gives us the flexibility we need to achieve the best search experience for our users.

Where to start?

There are so many functionalities around the search bar and within the search results that can make it complicated to decide what to work on first.

First, choose your engine. We chose OpenSearch as our inverted index search engine. The classic approach to search using an inverted index for the products of your catalogue is still a very powerful tool. Vector search using embeddings and nearest neighbors algorithms are getting better by the day. But search engines such as OpenSearch, ElasticSearch and Solar still provide a lot of functionality designed to address most of the problems that search needs to solve.

Start Fast

We built a proof-of-concept application for search so we could start crafting the search experience without waiting on engineering to write production app code. We used a Flask python application with a reduced UI connected to an OpenSearch instance. By doing this, we could start tuning the experience even before deploying the full architecture of the Marketplace. We could focus on user experience right from the beginning while our product teams work on migrating to our new platform.

With this simplified setup, we indexed our products directly from the data warehouse for development purposes. This enabled us to iterate over the index field mappings before connecting to the production catalogue service.

Although simple, this application contains all the main functionality that a search experience should have including autocomplete, filtering and sorting. We also run different versions of search in the same application so we can experiment and see the improvements with each iteration.

When text matching is not enough

In order to demonstrate the work we did and the impact it has, we’ll walk through a simple example of searching for “cafe” in our development search app. We’ll start with basic string matching and work our way up to something much more powerful.

As you can see, we are getting some coffee products on our results, but we are getting non-relevant results too. When querying only over text we might find ourselves matching every title that contains the word “cafe”, but we can’t identify if the product is relevant for our query. When our retailers are searching for “cafe” they typically aren’t looking for bags, they are looking for coffee products. But this kind of mismatch is typical in text-only search retrieval, as variants, accessories and unrelated items can show up by only matching text over the index fields. This issue gets worse as you index and search over longer text fields such as product and brand descriptions.

Another issue that comes up with text-only matching is that we might not get the most relevant result at the top. The exception would be searching for the exact full title of a product or brand name. In any other case, our retailer’s exploratory query will match only a part of the title or brand name, and with text-matching only there’s no way to know which result is more relevant.

Using function scores

Text matching is clearly not enough to get the best out of our search engine. That’s why we need to add extra information to our product index. This is where “query-independent factors” such as sales rank, popularity and clicks can help.

With this information in the index, we can use OpenSearch function scores over these new fields to boost the products that are more relevant to our users. Using the function_score query we can modify the score of the retrieved documents that were matched by the query based on functions of the query-independent factors.

For our “cafe” query, we can use a script_score function to boost the most popular items while keeping the score almost untouched for unpopular items.

"script_score": {
"script": {
"params": {
"baseline": 0.9,
"max_multiplier": 10
},
"source": "params.baseline + params.max_multiplier/doc['sales_rank'].value
}
}

When boosting our popular items for “cafe” we found out that the most selling item for this query is a hair dye product. This is not what we were expecting. Our retailers expect to receive the best coffee products at the top.

Query understanding

Function scores can boost our most important products for each query to the top, but they aren’t great for addressing the issue of unrelated items showing in our search results. To address this problem we need to understand the user’s intent when they show up to the search bar. If we could know what the user is thinking, we could see that the query was about coffee products.

This is why query understanding is so important for the search experience and the user experience in general in the marketplace. Because understanding what our users are looking for in our marketplace is crucial for their experience.

Our first query understanding model classifies queries into product categories of our products. When a query is entered in the search bar, we are classifying it first and then boosting and/or filtering the results from the classification. By doing this, all the products from the “Coffee” product category are now boosted to the top of our search results.

That’s it! No more non-relevant results in the top of our search results.

Query understanding is a very powerful tool. If we filter products from the wrong category our users will never find what they are looking for, so we must be pretty sure when boosting or filtering. Many strategies can be used to tackle this problem. The simplest approach is to apply the boosting or filtering only for classifications whose probability exceeds a certain threshold. Another strategy is to boost by different values depending on the certainty that you have for each category. In this scenario, we would get k classifications from the model and boost values for the categories that sum up to, for example, 90% of the total probability.

We started applying the simplest strategy first, as follows: first we use the query understanding model to predict the category of the products that we want to boost and then we added a function to our query that boosts that category. As we work with a Python dictionary, we can append this part of the query only when we are certain about the product category we want to boost:

if (label_value >= QUERY_UNDERSTANDING_THRESHOLD):

query_object["query"]["function_score"]["query"]["bool"]["should"].append({

"query_string": {
"fields": [ "product_category" ],
"query": label,
"boost": 1000
}

})

What’s next?

Early on, we focused on the projects that can impact the user experience. Integrating query-independent factors and building a query understanding model.

Once we have all these features in production, we’ve built the retrieval part of the search architecture and an initial global ranking layer. But which is the best product that we can show our users? Today we are using our judgment to assess the quality of the search results. But we are planning to use a machine learning approach to address this in the future. We will use Learning to Rank to train a model to rank our search results.

Next steps include working in more ranking layers after retrieval. With these layers we try to optimize and personalize the search results. We will develop offline batch rankers first. Then we will work our way up to real-time rankers to achieve the best user experience.

If you want to deep dive into Search concepts, check out CoRise courses: Search Fundamentals and Search with Machine Learning. Very big shoutout for Daniel Tunkelang, Grant Ingersoll and İlkcan Keleş for their support and feedback on this post!

--

--

Weber Matias
Building Inventa

I'm a passionate data scientist and marathon runner!