A/B testing is a technique which consists in creating two versions of a certain product, distribute it to two parts of the audience, and then see the difference in metrics for each product, choosing the one which had the best performances.

An example is with Youtube Thumbnails, where we can test click rate or retention rate depending on the shown Thumbnail.

This is not as simple as this looks, since there may be some biases:

The Sampling Bias: the version B users may be already more engaged with the content creator, hence giving the version B more performances. To solve this, we need to be sure to distribute the version in a completely random manner.
The Under Coverage Bias: you need a lot of samples in order to make the result statistically significant. If we do that with few samples, the result may be just random.

A/B testing can also be used in the machine learning field. Let’s say we have a model A that recommends products to users. If we have a new model B, we will do A/B testing before substituting the model A with B, in order to make sure that B is more performant in the wild.

Note

A/B testing is more popular, but we can also do A/B/C testing or even further. Usually testing only two items is more effective.

Quantify the results

By A/B testing, we are experimenting the feature on a variety of users, which gives a quantifiable metric to understand if the feature enhances the user experience or not. Ideally, everything should be A/B tested, but this isn’t always possible, because of time and cost constrains.

Once we obtain the results from the users, we need to make sure those results are statistically relevant using Hypothesis testing techniques.

AB Testing in Practice

Hypothesis

Define which is the change and what in theory should it change. For example: “removing youtube dislikes likes could increase the number of video published”

Splitting

Split the user base in two groups, the control group (composed by users who will see the UI with no changes) and the experiment group (composed by users with the new experimental feature). It’s important to define the target of experiments, in order to correctly split the experimental group.

When splitting the user base, we should follow some rules in order to avoid possible sampling biases.

For example:

Target currently active users, avoid users that are not active anymore since they may not use the product at all;
Usually is a good idea to select users from diverse regions of the world, since we would like to see metrics for all those groups of people.
Take users from all types of segments: new users vs returning users; engages vs cold start users, etc. This is useful to make sure that the chosen sample is representative of the real product user-base.

A common question is the following: if testing a feature dedicated mainly to a target of people, should we uniformly sample users from this group, or also take users outside the target?

todo: answer this

Metrics

Define which metrics to use in order to obtain the most useful insights from the experiment. In this example we can track the number of videos posted, the ratio between likes and dislikes, etc. Ensure also to have a guardrail metric, meaning some metric that shouldn’t decrease too much even if other are increasing.

Trade-offs

Every experiment has some trade-off. Find which are some potential pitfalls to the proposed feature that may not be evident through data alone. For example hiding dislikes may cause a general overall sentiment on the platform, which isn’t really quantifiable with metrics, hence it can be missed during the experiment.

Impact

How the result of the experiment can be useful to the team and to the future of the product.

Statistical Significance

In order for the experiment to be considered valid, we need a method to confidently say that the experiment made can be repeated, with the same results. This is what statistical significance tells us, and can be proven by using Hypothesis testing.

In the case of A/B testing, the control group’s metrics defines the null hypothesis, while the experimental group’s metrics define the alternative hypothesis. We compute the P-value by using a different method depending on assumptions on the data, like distribution, variance, sample size etc. If the value is below a certain threshold usually 0.05. The p-value is the probability of observing the data if the null hypothessi is true.

Note that statistical significance doesn’t guarantee the same result will occur in future repetitions, but it does suggest the observed effect is unlikely to be a fluke.

statistics data-science
Source:

Quartz 4

Explorer

AB Testing