GCP Retail Search Onboarding: Best Practices for Catalog data (Part1/4)

Shrish marnad
Google Cloud - Community
8 min readAug 17, 2023

When getting started with Retail Search, the primary driver of search results and performance is the data. Data here is the catalog info i.e. product info data and the user events. The retail search performance (relevancy, ranking and revenue optimisation) is extremely sensitive to the data uploaded to Retail Search. Hence we have multiple dashboards and data quality checks in place to make sure we notify any issues or potential flaws in the data and/or formatting of the data. If this is overlooked, the AI model will not get trained accurately and if we start with A/B test when the data quality issues persist ,the retail search is not guaranteed to perform optimally and can give the unexpected outcome. It would seem like Retail Search is not working as expected, while in fact the issue is almost always something to do with the catalog or user events data.

To make this process easier, I have put together a list of best practices when onboarding data to retail search. These are mainly under 4 sections.

  1. Product catalog best practices.
  2. Integration and configuration best practices.
  3. User events best practices.
  4. A/B Experiments best practices.

This blog only talks best practices about Product Catalog data. (Please check out part 2, part 3 and part 4 for the other section best practices)

Product hierarchy structure:

There are three main product classification hierarchies (or Product type) i.e. PRIMARY — VARIANT ,PRIMARY only and COLLECTIONS type product.

In PRIMARY — VARIANT structure: The primary is almost usually only a place holder of (common) information and the variants are the actual SKUs which can be purchased. Example :

  1. A T-shirt primary product can have brand, common attributes, description etc and the variants have the differentiating info like color, size, price etc
  2. A phone primary product can have the common information about the phone features, brand etc and the variants can be the phone with the specific RAM, screen size, battery capacity etc

In PRIMARY-only products, the primary product itself is the sell-able SKU. Example: A unique pair of headphones that comes only in one color or a specific design of jewelry item. These items cannot be part of any PRIMARY — VARIANT structure.
For products that have a variant, it is recommended to structure them in a PRIMARY-VARIANT hierarchy , as there are multiple advantages. Some are:

  • The search results page will have diverse results sets that can be displayed to the end users, otherwise the search result page will get filled with the same products if the variants were treated as primary products.
  • Ranking — The products will have a richer ranking scheme as primary with variants will get ranked better if a particular variant is getting more engagement. This will help in re-ranking and revenue optimisation.
  • Ease of maintaining the catalog, If an attribute has a change for a group of products that differ only by size , then it can easily be done using a PRIMARY-VARIANT structure i.e changing the attribute at the PRIMARY level instead of changing multiple PRIMARY product info.
  • API features and search response fields of variant rollup keys are supported only for variants.

Product availability correctness.

The Product availability field is typically set by the inventory update system as the product stock state changes. It is highly recommended to keep a track of all the products that are in IN_STOCK and OUT_OF_STOCK state. If you have the majority of products as OUT_OF_STOCK the search response would have many OOS products and on adding a filter for IN_STOCK, the recall numbers will reduce dramatically. If the product has gone out of stock but the catalog state is IN_STOCK, then users will see the product as available but will probably face issues at the time of purchase / add-to-cart. This affects more on the user experience than the AI model training. It is recommended to keep the Product.availability field as update-to-date as possible using the patch APIs or import APIs with a readMask.

Use native fields whenever possible instead of custom attributes.

The Product info schema can be referenced here. The schema is quite extensive and accounts for a wide range of fields that are normally used like: brands, audience, materials, size etc.

For all other product attributes that are not part of the Product info schema , we recommend to use the Product.attributes (custom attributes). The native Product fields like title, description, brands etc have a bigger impact on the searchability, indexability and other properties, as compared to the custom attributes. In other works the backend understands the native fields much better than the custom attributes and the backend takes into account the native fields info into the optimizations for relevance. Therefore it is highly recommended to use the native fields (i.e. map your Product information to native fields) as much as possible, and use custom attributes only otherwise. For example: setting the brands in the Product.brands field has a much higher impact on search and recall than setting the same info in a custom attribute. For an attribute like “sleeve length” which is not natively supported, you should use custom attributes and mark the searchable, indexable, facet-able flags appropriately

Importance of Brands field.

Brands field in the product info which is by default searchable, indexable and facet-able is a strong signal for ranking and relevance. A good percentage of the search queries are of the form “brand query” or “query brand” , like “adidas t-shirt”. Arguably brand is possibly one of the most heavily used facets. The click and purchase conversion ratios get affected heavily if the product has the incorrect brand field. So it is important to have the brand field populated with the correct info and if possible to never be left blank. What is more detrimental is to fill in random fillers in the brand field like “NA” or “Not available” or “Miscellaneous” etc. This makes the product to be strongly associated with the text mentioned in the brands field which might lead to wrong product understanding and bad recall. If a particular product is absolutely not associated with any brands then it is recommended to keep the fields empty. But care needs to be taken that these empty brand products are a small fraction of the catalog products.

Importance of Audience field.

There are two subfields in the Audience info field of the Product schema. There are Audience.gender and Audience.ageGroup. It is highly recommended to fill this field with the appropriate data. This will help the AI model understand whom the intended audience the product is for. This will play a big part when personalisation is enabled. Having gender and age group will help segment the products better and will help the AI model to recall the right product for the appropriate user when personalisation is enabled. The Audience data is also helpful outside of personalisation when we have queries like “shirts for women” or “mens socks”. With the audience info populated , the product understanding is much better and the AI model is better and is able to better recall the right products for gender cued queries. Apart from this the audience field can be used for faceting which is a popular facet in many product categories.

Look out for products with duplicate titles.

The Product.title is probably the most important field as most of the search queries would have a huge overlap with what is set as the Product.title. It is probably the first information that the end users would see and interact with in the Detail page view. So it is good practice to keep the Product.title unique and have text information that is most relevant to the Product. Having two products (primary products) with the same title, affects the searchability and relevance of the returned results. If there are two separate primary products, then they would be different in at-least a few attributes , hence it is recommended to keep the titles different. If the products are the same but differ only in a few aspects like color, size etc then it is highly recommended to structure the products as PRIMARY and VARIANT types.

Language configuration.

Retail Search supports multiple languages. More info here.
The main thing to note is that the catalog and search query needs to be in the same language. There is no cross language translation of query or catalog info (p.s. more language features are in the roadmap). For example if your catalog is in Spanish, the search query also needs to be in Spanish. So it is important to mark the language code in the product info accordingly which otherwise would default to english (en-US). This is important for search controls like spellCorrectionSpec where if the language is not set it would lead to unexpected behaviour and in-turn hurt the search performance. This is also extremely important for query intent understanding.

Price info settings.

The Product.priceInfo field needs to be populated with as accurate as possible. It is recommended to populate all the fields of PriceInfo. This price info is used to derive discount related signals and is used in revenue optimisation. This is particularly important for Browse queries.
For a PRIMARY-VARIANT product structure , it is recommended to populate the price of at least one of the variants.

For a product that doesn’t have product level pricing and all the pricing is in the local inventory (i.e. the search is always tied to a local inventory) it is recommended to fill the median price info of all the inventory level pricing at the product level price info.

Product uri(url) correctness.

The Product.uri field is typically the product description page link. The requirement is it should be a publicly crawl-able url and not behind any login/auth wall. This is because the backend crawls the url webpage and derives as much information as possible which is used for relevance and popularity scoring. The backend also determines how the url was interacted on the web (backlinks etc). It is recommended to have the top level domain name to the same across all the product uris.

  • If you happen to have the same product listed in multiple banner sites, Then please consider using the multi-entity feature. Please contact the account team for more info about this.
  • If you happen to use a different url in the product catalog and a different url in the actual site, then please make sure the two urls refer to the same product and have almost identical information.
  • It is also strongly recommended to not re-use a product url once a product is deleted. Instead have unique urls for each product

In conclusion, it is important to understand the AI model performance is extremely sensitive to the catalog data and it is imperative to have accurate data at the appropriate fields.

--

--