Author’s note: This blog series follows logical arguments made regarding the false positive rates in published scientific research, and adapts these arguments for online AB testing (Online Controlled Experiments). The 2005 scientific article makes these points in a far more elegant way, so if you find this blog post interesting, please do check out Dr Ioannidis’ original article here.

What is AB Testing?

User-intensive applications, such as websites, are constantly monitored, refined, and improved.

One of the ways to improve your website is automated AB testing - that is, comparing two variants (A and B) of the same application by randomly assigning live users to one of the two variants and gathering metrics of interest. Often, the metric is “conversion” - whether or not a user performs a behavior of interest, such as clicking on an ad, or purchasing something.

Typically, the “better” variant is chosen based on a probability test, represented by a p-value. If the p-value is below some predetermined number, we say that the new version is probably better than the old, and the “better” version is deployed.

Is it Pseudoscience?

Short answer: Not yet. AB testing, or online experimentation, is a real and very effective form of science.

When it works, AB testing is so effective as to seem almost magical - the companies that harnessed it a decade ago have experienced unbelievable benefits.

But it can go wrong.

Sources of Bias in Online Experiments

AB tests have a high rate of nonreplication - that is, a later AB test of the same variables may find a different, or even opposite, effect.

This is a problem, because it means that one of those results is wrong. Either we incorrectly rejected a good variant of our website, or worse, we deployed something that really wasn’t helpful to our users at all.

Bias is caused by a combination of unexpected factors that leads to a result that is not true, more frequently than might be expected by chance. It’s a systematic error in findings, usually occurring as a result of some unobserved source of heterogeneity.

The problem with AB testing, when it’s based on p values, is that the chance of getting a positive result (hooray!) is directly, mathematically, related to the chance of getting a false positive (oh no!).

That is, as you get more powerful (more likely to find a true signal), you also get more susceptible to bias (more likely to find a false positive).

What Not to Do About It

Conducting an online AB test, or experiment, often requires investigators to provide evidence that conducting the experiment will be productive. Online experiments are expensive. If it’s possible to gather the needed result with fewer samples, or in a shorter time, then we should do so.

The need to make online experiments more efficient has resulted in many pieces of bad advice floating around the Internet. Most of this advice boils down to “increase your risk of a false positive” and “outright engage in p-hacking” to reduce the required sample size of your experiment.

When followed to its logical conclusion, this approach is likely to result in a set of random false-positive results.

Given that it costs time and money to conduct the experiment, it seems somewhat irresponsible to trade-in your certainty of a real answer for the thrill of getting a positive result on your statistical test. If you’re going to indulge in this sort of thing, better to do it in private.

What to Actually Do About It

Statistical methods based on effect sizes and knowledge synthesis - referred to, in offline science, as “meta-analysis” - have ways to improve statistical power and reduce sample sizes without a corresponding increase in false positives.

Meta-analysis refers to the statistical synthesis of results from a series of studies. If an uplift effect is consistent across a series of AB tests, meta-analysis procedures enable scientists to report that the uplift is robust across different contexts, and also to estimate the impact of the uplift more precisely than with any one test alone. If the uplift effect varies, meta-analysis procedures may enable us to identify which factors are predictive of a smaller or larger uplift for that variant.

Meta-analysis of AB test results can be used to design new variants to test. As a first step, a meta-analysis can determine whether a new variant test is necessary. It may be possible to answer a question about website design by analysing the cumulative effects of past AB tests, without paying for a new test. In the event that a new AB test is needed, the meta-analysis may provide an estimate of the prior effect size and the contextual elements that contribute to the success of the variant.

There are many additional benefits. Follow along with the upcoming series of articles to see how this works in practice.

If you want to jump to the end and implement the solution today, check out our ABSynthesis API here.