How to evaluate modifications of a technological product with the help of data

Why might we want to evaluate the modifications made to a product? To make sure they’re good and not bad. But what do we mean by good and bad here? To do this, we establish a baseline — this is what we’re going to compare the modifications to. The baseline will likely be the same product but before the modifications. By comparing the data of the two variants, we will be able to assess the modifications we have made. I’d like to talk about this in a little more detail.

Nikolai Neustroev

Published in

inDrive.Tech

15 min readMay 16, 2022

Introduction

What’s good and what’s bad from the company’s perspective? First of all, it’s necessary to understand that for yourself and devise some business metrics: user coverage, DAU, MAU, profits and so on.

What’s good and what’s bad from the point of view of machine learning models? To get a sense of this, you need to set a target, and choose some machine learning metrics and loss functions.

What’s good and what’s bad from the point of view of the service as a technological package? On top of the business metrics and the ML metrics, you also need to draw up some infrastructural metrics: RPS, memory usage etc.

We have drawn up all these metrics, run our experiments and obtained some figures. What we now need to work out is why we’re doing all this and what’s going to happen next. There’s a little magic trick to help us with this, known by the acronym OBAMA. It stands for Objective (the business goals), Baseline (our reference point for comparison), Metrics (what we’re measuring) and Action (what we’re going to do next).

Experiments and quasi-experiments

Now for the most important point: how we’re going to evaluate all this. There are genuine experiments that can be done, for instance. We exert, or record, an external impact on the process that interests us.

By conducting experiments, we achieve a high degree of control over internal and external factors. The researcher decides when to take the measurements, the composition of the observation units, and the order in which the independent variables are engaged. They also randomly removes observation units and independent factors, demonstrating them to groups of respondents.

The presence of a control group in the experiments is mandatory. This is a group of people who are not influenced by the effect that it is planned to measure.

To conduct experiments, serious infrastructure is required. The ability to merge at least 2 groups of users together must be built into the product. The tech giants can afford to do this, but startups can’t.

For these kinds of problems, there are quasi-experiments that can be done. They, too, measure impact. We make, or record, an external impact on the process that interests us.

What’s different about quasi-experiments is the low level of control involved. If you are not able to determine when the measurements are taken, the composition of the observation units and the persons observed, on whom we are exerting influence, or the order in which independent variables are engaged, you will not be able to deliver a true experiment, and you should look into quasi-experiments. In these experiments, there is a control group that is not equivalent to the treatment group.

Quasi-experiments are a good alternative in cases where it’s difficult to conduct a true experiment. The historical data that has already been accumulated is often sufficient.

A/B-test

Let’s say a few things about experimental methods. The first of them is the A/B-test. In the A/B-test, each group in the experiment sees its own variant. The groups are independent and are divided up at random. The groups must not be divided up on the basis of a trait they all have in common: date of registration, gender, profession, etc.

The groups must all be taken from the same population. This is necessary so that the results of the experiment can be extrapolated across the whole target audience. The groups don’t all have to be the same size. The audience can be divided in different proportions, though this could cause difficulties in interpretation: it might increase the confidence intervals, for instance.

If you wish to conduct an A/B-test, first define the target, and decide whether it suits you, rather than doing custdev, a focus group or something else. Have a think about whether you want to get answers to the question of quality (have things gotten better or worse) or the question of quantity (things are better by X percent or worse by X percent), how change B ought to affect the strategic goals, or whether change B contravene the company’s values.

If everything’s OK, you can move on to the second stage — defining the metrics. First, define the levels of metrics and the indicators at which the measurements are aimed. After that, develop your hypothesis. Start with the null hypothesis: the assumption that your change has not had an impact on anything.

This hypothesis can be rejected if the data shows that there is a difference between samples at a given level of significance. This means that the distributions of the two samples don’t differ, and the condition is not fulfilled. If you have recorded a difference between your samples, then you can reject the null hypothesis.

For instance, you are working at a company that helps car drivers earn money. You have a calculator for calculating the price of a ride. You assume that it is going to increase the number of rides. The null hypothesis will be that there is no statistically significant difference in the number of rides between the audience for the new calculator and the old one.

We can reject this null hypothesis at the given level of permissible type I error. This is called the p-value. Usually, if your error is less than 5%, you can reject the null hypothesis and take it as read that your changes are having an impact on the product.

In the next step, define the type of distribution and the statistical test. For example, if your binomial distribution (where either zero, or one is taken as a value), then you should use Bayes’ rule, for instance. Examples of such distributions may be conversion or retention.

Your distribution may be similar to normal distribution. A test for normal distribution can be done with the help of Q-Q plot; apply tests such as Student’s t-test or the Mann-Whitney U-test.

A skewed distribution is the most common type of distribution, when your results are skewed toward one side. This might be revenue, number of purchases, or prices, for instance. It is worthwhile applying the Mann-Whitney U-test or the bootstrapping method here.

Once you have got to grips with the distributions and tests, we define the level of statistical significance, i.e. the level of risk that you take when type I errors are made. These errors are also known as False positives, when you are recording a positive result, but it is a random one, and was not caused by any impact.

The ability to perceive a significant discrepancy in the metric is known as power. The higher the level of power, the lower the number of good experiments that will be erroneously overlooked. Usually, power is a unit minus beta. For the beta, 20% (0.2) is usually taken.

It is also important to decide whether this is going to be a one-tailed test or a two-tailed test. The one-tailed test allows you to discover changes in one direction, what has got worse or better. The two-tailed one allows you to discover both positive and negative changes.

Next define the sample size. To do this, you need to know the type of metric and the level of significance level, power, and effect size that you wish to record. When you know this, you can work out how many participants are required for the experiment.

You can then set about dividing up the users. You should divide up the respondents into random and equally sized groups. Next, collect data on the metrics that interest you prior to the start of the experiment. This step is optional, though; it is required, for instance, in order to speed up an experiment using the CUPED method or to evaluate a splitting system.

Finally, launch the experiment — and no peeking while it’s taking place. While the experiment is ongoing, it is important to monitor roughly the same size of samples. If the sample sizes do not match, this will mean that the experiment fails. You will need to stop the A/B-test and conduct an A/A-test.

At the end of the experiment, analyze the data thoroughly and confirm/reject the null hypothesis. It is important to accept or reject the null hypothesis with the help of the test you selected in advance. Please don’t try to play catch-up — define the test in advance.

A/A-test

The next method is the A/A-test. This test is used to carry out an inspection of a split-system, or to select homogeneous groups. The A/A-test is a variation on the A/B-test, the distinctive feature of which is clear from its name. While in the A/B-test different variants are compared, in the AA-test, the original is juxtaposed against itself. The main goal of an A/A-test is to show whether or not we can trust the results of an experiment that is launched under the same conditions, but with different variants.

If no winner was identified during the A/A-test, the A/B-test can be initiated. Otherwise, the settings for the service and the uniformity of the sample will need to be checked. The A/A-test provides control data for inspecting the accuracy of the A/B-test.

А/В/С/n-test

In this test we have more than 2 variants. What is the problem here? Experiments like this are beset by the multiple comparisons problem. To test each statistical hypothesis, the ability to make the type I error, i.e. a departure from a true null hypothesis, is built in. The more hypotheses we test on the same set of data, the higher the likelihood will be that we’ll make at least one such error. This phenomenon is known as the multiple comparisons effect.

For example, you conduct a Student’s t-test, and test a null hypothesis that there is no difference between the general values and the mean values in the two groups being compared. If we compare groups A and B, we risk making an error with a probability of 5%. But the thing is that the same probability of making an error exists when comparing B with C and A with C. Accordingly, the probability of making an error in at least one of the three comparisons is over 14%, which is much higher than 5%. The subsequent increase in the number of hypotheses tested will inevitably be accompanied by an increase in the number of type I errors.

Team-Draft Interleaving

The next type of experiment is Team-Draft Interleaving. The idea of interleaving is not to break the sample down into two and bust your brain trying to observe all the necessary conditions for representativeness, but to use the same sample of people and show a mixture of answers from the two systems. This is relevant for recommendations, ranging and advertising. We must remember, incidentally, what the user selected.

This is where the difference is demonstrated. Whereas in the AB-test we divide the audience into two groups and show different variants, in interleaving we combine two variants on one page, and thereby calculate the difference.

Difference-in-differences (diff-in-diff)

Let’s talk about the second group of methods for making changes — quasi-experimental methods. To start with, I’m going to describe Difference-in-differences (diff-in-diff). This is a quasi-experimental method which compares the change in the results, over time, between the population included in the program, the impact group, and a population that is not participating.

The strong point of this method is the intuitive interpretation. If you implement suppositions using historical data, you can obtain a causal link effect. You can use data both on an individual level, and at group level. The groups being compared can start at different result levels. Diff-in-diff focuses not on the changes, but on the absolute levels of these changes. Not on the figures per module, but on the trends.

It does have its limitations, however. This method can’t be used if the very fact of intervention depends on groups. If you want to exert an influence on just one group, for instance, you can’t divide the groups up by whether an action was taken. It’s impossible to use if there are different trends in the groups you are comparing. It can’t be used if the make-up of the groups is unstable.

You can see in this graph where the essence of this method lies. We find trends in the two groups and collect data about which trends existed prior to the intervention. The intervention is the blue vertical line down the middle. After the intervention, if the effect exists, we record the change in the trend in the group affected. In the control group, nothing should change. Thus, we can record the fact of the changes itself and calculate how the change was expressed.

Regression Discontinuity Design

The idea that lies at the basis of this regression is a comparison of individuals located in the immediate vicinity of a certain threshold, defining the right to take part in a particular program or instance of influence.

Let’s take, for example, drivers, who exceed a ratings threshold and the system bans them. In such a case, the individuals who are close to the threshold have similar characteristics, except for the fact that they have not been banned. In other words, the conclusions drawn on the sample of individuals located immediately above or below the known threshold, may be just as reliable as when a randomized experiment is conducted.

Let’s look at the graph. On the Y axis we see the driver’s income, and on the X axis, their rating. The vertical line is the threshold; when the rating falls, a ban is imposed. We can compare the figures on how much income a driver loses, on average, when they are banned. Another thing that is convenient about this is the fact that we can not only work out the difference, but also model, with the help of regression, what would happen if the threshold were set at a different value.

Instrumental variables

Let’s say you’re interested in how X influences Y. But you know that Y is influenced by a certain factor, E. How can you find out the exact impact that X is having? One of the solutions is to find an instrumental variable, Z. You are sure that X depends on Z. You are sure that Z and E have no impact on one another. Unfortunately, you cannot simply replace the regressor that is causing the problem with an instrumental variable.

Let’s say we have two interconnected variables — education and salary level. We would like to explore whether or not education leads to a higher salary, whether X leads to Y. It makes sense to do this.

Now for the problem associated with the theory. Does X really lead to Y? Yes, getting an education leads to a salary. But what if the people who strive to go through higher education are also going to get a higher salary, because they are the more energetic, ambitious and driven part of the population? That poses a problem. Because it’s not just X that leads to Y; something else does, too. That “something else” is at present swallowed up by the term “error”, because ambition is something that we cannot measure. This violates our core assumption about linear regression. Situations like this are known as endogeneity.

We want to use the dependency of Y on X. But it soon becomes clear that education and salary are also dependent on some other factor, E — ambition, striving, all manner of qualities that can’t be measured.

What are we to do, if we can’t represent ambition in numerical form? Use something else that’s measurable, that correlates with education, but has nothing in common with ambition. This is called an instrumental variable. It cannot correlate with the error, with ambition, but really does correlate with education, with X.

Smoking at an early age is an excellent instrument, in our example. Because smoking at an early age and the number of years students spend at school are interconnected. On the other hand, taking up smoking early and ambition are not connected, because a lot of successful people had difficult childhoods, during which they smoked.

Thus, using instrumental variables, we can rule out the impact of an unmeasurable number on the metric that interests us.

Property Score Matching (PSM)

The next method for conducting quasi-experiments is Property Score Matching (PSM). This is a method in which the researcher uses statistical methods to create an artificial control group. They compare each unit processed with one that has not been processed, with each participant in the experiment on whom an impact was exerted, and with a participant in the experiment from the control group.

The PSM calculates the probability of the observation being included in the experiment based on the characteristics observed. This is called the propensity score. The PSM then compares the units processed with the units that have not been processed on the basis of a predisposition score. The method is based on the supposition that, depending on certain characteristics observed, the units that have not been processed can be compared with the units impacted, as though the impact had been fully randomized.

It can be seen in the image that there has been an impact on certain ducks. Regrettably, though, we were not able to divide up the participants in the experiment randomly. Here, there was some kind of group that was known to us in advance, but the control is not known.

In such cases, we can select someone, for each participant in the experiment, who did not participate in the experiment, but is like them — in terms of their characteristics. By uniting the participants in the experiment with people who did not take part in the experiment, we can make an artificial control sample and use it to interpret the results of the impact.

Synthetic control method

In synthetic control method, the impact occurs on an aggregated level, for example, a city, country or region. Also, there is only one instance of impact and several control instances.

There are 3 benefits to using a synthetic control. The participants in the control sample are determined by weight, within limits from zero to one. Selection criteria are imposed on the synthetic control and the relative importance of each donor from the control sample are explained. The selection of a synthetic control does not depend on the results after the interference, meaning that it is impossible to select a design for the study that can affect the conclusions.

The classic example is the passing of a law restricting the right to smoke in California. What’s the problem with this? The problem is that there is only one California, so by definition we can’t do an AB-test, by looking at a bunch of other Californias.

We therefore select states which, on the basis of various social, economic, cultural, and climatic parameters, are similar to California. Let’s say we’re interested in the number of cigarette packets sold. We work out the number of cigarettes sold in these control states, whether any restrictions on smoking were introduced, and how the restrictions affected cigarette sales.

We take the number of sales, multiply them by weight and add them up. In this way, we can create a synthetic California. The dotted vertical line is the moment when the restrictions were brought in. Prior to the time the restrictions are brought in, our California looks like real California. After the impact has taken effect, we can see that fewer cigarettes are sold in real California than in the hypothetical, synthetic California.

P.S.

I can recommend several helpful books. On experiments, there’s the book “Trustworthy Online Controlled Experiments” by Kohavi, Tang & Xu. This is a book for practitioners; it does not take an in-depth look at the theory. It sets out, in an accessible way, how to conduct experiments correctly.

With regard to non-experimental methods, I recommend the book “Causal Inference: The Mixtape” by Scott Cunningham. This book, too, is intended for practitioners. It does not take a deep dive into complex examples. The second book is freely available, incidentally, you can go to the website and browse through it.