What is AB testing?

In the early days of the internet, basically everything was text-based and front-end development didn’t really exist. As designers got involved in the scene, user interfaces started improving dramatically, and early websites with good designs were repaid in traffic.

However, as websites, infrastructure, and businesses became more complex, so did the process of developing a great website. Around 2015, academics realised that an approach based on the scientific method was being used in software development.

The work of engineers at many of the large software developers at the time led to the emergence of automated AB testing – that is, comparing two variants (A and B) of the same application by randomly assigning live users to one of the two variants and gathering metrics of interest.

As visitors arrive at a website (or app, or any other digital platform where data can be collected), a digital coin flip determines whether the visitor is shown the old (base) version, or the new (variant) version. From there the experimenter starts counting how many visitors in each version do the desired thing. The “better” variant is then chosen based on a probability test, represented by a p-value.

If the p-value is below some predetermined number, the line on the screen goes green, we say the new version is better than the old, and the “better” version is deployed.

AB tests are becoming so popular it’s nearly impossible you haven’t been involved in one. For example, every time someone logs in to Booking.com, the website might be slightly different. This sort of website design is a direct result of Booking.com’s automated AB testing platform, and almost all major online companies are doing the same thing right now, with similar effects. When you randomly can’t find the button you need in an app, the chances are you’re in an AB test.

Why is it so popular?

The primary use of AB testing is actually in software development, as a complement to traditional unit testing. Those initial experiments weren’t really trying to detect ‘uplift’ in the way we’re used to now – they were detecting bugs by looking at user behavior. If the experimental version of the site had extremely poor metrics, it was an early and clear indicator that it had a bug.

As a bug detection mechanism, AB testing works well. There’s no issue with false positives, and an incorrect decision is unlikely. As a result, development can happen exponentially faster than without AB testing. The risk of an individual developer or team breaking the site is dramatically reduced. AB testing is popular because it unlocked a new method of high-speed, low-risk software development.

The speed of development that was unlocked with AB testing quickly convinced managers to ‘port’ it to other uses: choosing ad keywords, and front-end website optimisation. The outcome of this in the early days was earth-shattering. Companies adopting this approach started draining traffic from their competitors, and increasing conversion from that traffic in a way nobody had seen before. The giants of tech in 2024 have much to thank from the tiny equation that drove this wave of profit.

Not only were companies more profitable, they had better leadership meetings and better investor calls. Internal business arguments that were once fought on strength of belief could now be settled with hard statistics. Gone were the ideological and philosophical bases of web design, with AB experimentation providing all the evidence an investor required to justify any change.

In 2024 there is still much to be gained from adopting a ‘culture of experimentation’. It is certainly true that in most businesses, data driven decisions will be more effective than purely intuitive ones.

What’s wrong with it?

While AB testing has been incredibly useful in the age of big tech and bigger investment, there are unsolved problems that hint at the limitations of the approach.

Most concerning is that AB tests have a high rate of nonreplication. Often, a later AB test of the same variables may find a different, or even opposite, effect. In fact, there are countless examples within major companies where two AB tests of the same variant come to opposite conclusions. It is partially for this reason that you’ll often see companies actively avoid replications – it’s not a fun announcement to tell everyone that the last round of expensive tests has been negated. On the other hand, with no replications, you can forever stack uplift upon uplift, making for wonderful happy shareholder calls forever.

Industry methodologists have pointed out that the high rate of nonreplication of AB tests is a consequence of the strategy of accepting a “positive” result based on p-values. In response, the data science industry has made their calculation of p-values more nuanced. They add variance controls like CUPED, force mathematical corrections, or mess with different ways of grouping the participants. All of these are a bandaid solution though, because they don’t solve the underlying problem – the software runs each experiment in isolation, essentially forgetting about all the visitors and experiments before it.

While this might seem like a small technical issue for each single AB test, when expanded to a whole programme, AB testing may have little to no impact on improving websites at all.

What you can do

There are systematic ways to solve the problems of AB testing that aren’t just adjustments to the way we analyse the data.

By combining the outcome of multiple experiments, we get more realistic ideas about what is actually making the difference to our results, and run more useful experiments each time. The approach is known as ‘meta-analysis’, and it’s generally acknowledged by the science community to be the very highest standard of evidence. It’s one of a set of ‘knowledge synthesis’ models that are still relatively uncommon in the online space.

In this series we’ll be digging into the problems with p-values, selection bias, and forgetting the past. We’ll also be talking about solutions, outlining why meta-analysis is more than another band-aid solution and demonstrating just how much better your experiments can be by adding the final layer of analysis to your workflow.