Evaluation on how products’ ranking from a new proposed search strategy differs from its legacy

Putri Wikie
Bukalapak Data
Published in
4 min readJan 3, 2020

A simple yet powerful statistical test as a judgment whether products’ ranking from two strategies is significantly different

Background

As one of the important funnels to the growth of e-commerce businesses, a search engine receives high attention to be constantly refined for better user experience. Once a new search strategy is proposed, a natural next step is to first test the strategy on a subset of users and compare its impacts to currently applied strategy (referred to as legacy). This test or experiment (as also known as A/B test) is conducted by randomly splitting sampled-users into two groups (namely users who will experience (i) legacy and (ii) a new strategy) and further by observing the difference performance of both strategies given metrics of interest.

An offline-assessment, however, may be considered as a preliminary judgment whether a new algorithm would potentially bring a significant change, as compared to legacy. This article shows how to do an assessment prior to an experiment, as such the chance of finding unimpactful changes is minimized. In the context of search strategy improvement, the impact of a new search strategy can be preliminary assessed by comparing the difference of products’ ranking between two strategies. The premise is, an experiment is worthy to be initiated if the order of searched-products is different between legacy and a new proposed strategy.

The proposed offline assessment

Suppose for a given keyword k, N products are recorded their ranks that are resulted from both legacy and a new strategy, as illustrated in Table 1.

Table 1. Rank comparison of products for a given keyword

Next, to quantify whether the expected value of D is different with zero, a t-test is carried out to test this following hypothesis,
H0 : There is no difference in ranking between legacy and new strategy
H1 : There is a difference in ranking between legacy and new strategy

The analysis is then repeated to other top- keywords. To adjust for the false discovery rate due to the multiple comparisons, p-values from the tests is therefore corrected. To elaborate on the necessity of this multiple testing correction, consider this following example. Let say 10,000 statistical tests are simultaneously performed, in which all of them resulted in rejecting the null hypothesis with a critical value of 5%. As in each test, there is a 5% chance of making a false decision, there will be approximately 500 false positive findings out of the 10,000 tests. The multiple testing correction plays a role to control for the false discovery rate, i.e. finding a significant result by chance. Here, a less conservative Benjamini-Hochberg (BH) procedure is chosen for multiple testing correction. The BH procedure works as follows.

Figure 1. Pseudocode for Benjamini-Hochberg correction. It is important noting that the pseudocode is modified as such the output can be directly compared to . See p.adjust documentation on R CRAN for further details. order: returns a vector whose elements are indexes of the sorted input values (in ascending order if decreasing=FALSE or descending order if decreasing=TRUE). pmin: returns the parallel minima of the input values; cummin: returns a vector whose elements are the cumulative minima of the elements of the argument.

For a keyword, we may conclude that a new algorithm yields different products’ ranking as the legacy when the adjusted p-value is lower than a predetermined critical value.

A practical example

To have a better illustration of the proposed pre-experiment assessment, “shoes” was used as a keyword and two strategies were involved to rank the products, i.e. legacy and a new strategy. The products and its rank resulted from the strategies are partly depicted in Table 2.

Table 2. Rank comparison of searched-products when “shoes” was used as a keyword

Visually, the new strategy gave different ranking to the searched-products as the legacy (Figure 2).

Figure 2. A heatmap from Table 2 to visualize the difference of ranking between legacy and a new strategy.

A more formal test was conducted to test the difference in ranking between two strategies. In this example, the p-values resulted from such test is 0.01515. Further, the test was applied to ten selected keywords and p-value from each keyword was recorded. The adjusted p-values following the procedure in Figure 1 are presented in Table 3.

Table 3. The p-values and adjusted p-values resulted from t-statistics on the ten selected keywords

Concluding remark

This article demonstrates the use of a simple yet powerful statistical test, i.e. t-statistics with Benjamini-Hochberg correction, as a judgment whether products’ ranking from two strategies is significantly different. The pre-assessment plays a role to prevent conducting an unnecessary experiment, as such production costs can be minimized. It is important noting that this approach is not aimed to evaluate the quality of a new ranking, but rather to evaluate whether a set of ranking yielded from two settings are not significantly similar. This approach can also be used for any other ranking refinement problem, such as for a recommendation system.

--

--