In the previous exercise, we simulated 1,000 datasets and ran a Chi-Square test for each one, recording whether the results were ‘significant’ or ‘not significant’. This allowed us to estimate the proportion of simulated datasets that led to a ‘significant’ result.
In general, we hope that the test reflects reality. We therefore want the result to be ‘significant’ if there really is a significant difference in the probability of an open for the two email subjects (lift > 0). In that case, the proportion of significant results is the true positive rate, also called the power of the test. Most sample size calculators aim for a power of 80%.
On the other hand, if there’s no difference in the probability of an email being opened for the two email subjects (lift = 0), a ‘significant’ result would be a false-positive (also called a type I error). This would lead us to invest time and resources into adding first names into email subjects when there’s no real pay-off in the long run.
The simulation code from the previous exercises is loaded for you in script.py. We’ve included the code to print out the proportion of tests where a significant result was recorded. Currently, the simulation is set up so that there is a difference in the probability of a subscription for the two buttons.
Press “Run” a few times and inspect the proportion of significant tests (printed to the output terminal) each time. If we ran a test with the provided sample size (100), baseline conversion rate (50%) and lift (30%), approximately what percent of the time would we correctly observe a significant result? Note that this is the “power” of the test.
Now, change the value of
lift so that the proportion of significant tests is equal to the false positive rate and press “Run” once more.
Note that the proportion of significant tests should be approximately equal to the significance threshold if you’ve done this correctly.