Suppose that a Product Manager is running an A/B Test for a redesign of a landing page. Before starting the test, she used a sample size calculator to determine the sample size: 2,200 total website visitors. After reaching 2,200 visits, she ran a Chi-Square Test. The new website design performed slightly better, but the results were not statistically significant.
It might be tempting to run the test for another week to see if the difference becomes significant, but that would be a big mistake! By choosing to extend the A/B test past the original sample size, the project manager would introduce personal bias to the results of the test; she will be more likely to get the results she wants, regardless if these results reflect reality.
Here are two important rules for making sure that A/B tests remain unbiased:
- Don’t continue to run the test after the predetermined sample size, until “significant” results are found.
- Don’t stop a test before reaching the predetermined sample size, just because your results reach significance early (unless there are ethical reasons that require you to stop, like a prescription drug trial).
Test data is sensitive to changes in sample size, which is why it is important to calculate beforehand.
Inspect the graph in the workspace. It shows an A/B Test where the baseline was 5%, and we want to see a lift of 50% (i.e., we want our second option to have at least a 7.5% conversion rate). A sample size calculator tells us that we need 210 observations. The chart shows the cumulative conversion rate after each new observation. When we reach our desired sample size of 210, our cumulative conversion rate is slightly higher than 5%, but the difference is not significantly different (indicated by red). By extending the experiment to 320 samples, the difference becomes significantly different (indicated by green). We might conclude that our results are significant if we stopped the experiment at this point. However, we can see this is a temporary fluctuation. After this brief moment of “significance” the conversion rate decreases and our results become insignificant again. By arbitrarily extending the study until it reaches significance, we fool ourselves!
Try this: Flip a coin five times. Which side came up more frequently? Perhaps you now suspect that the coin is biased. Keep flipping the coin until that side shows up even more frequently. By changing your sample size in the middle of an experiment, you can easily convince yourself that a fair coin is biased.