In the last exercise, we used some data from an A/B test to run a Chi-Square test. In the next few exercises, we’ll build up a simulation to understand the considerations that go into choosing a sample size for that test.

Again consider the A/B test example from the previous exercise, comparing email subjects with and without the recipient’s first name. Suppose we know that visitors have a 50% chance of opening the control email and a 65% chance of opening the name email (30% lift!).

Here we use lift to refer to the inherent difference in the distributions of our two groups of data. In the A/B Testing: Sample Size Calculators lesson, we learned that minimum detectable effect is the smallest size of the difference between the two groups that we want our test to be able to detect. If we set up our experiment with a minimum detectable effect of at least 20%, our statistical test should detect a difference with a “lift” or “effect” of 20% or greater. In this lesson we are going to simulate data that has a lift of 30% to demonstrate how the inherent lift impacts the power of our statistical test.

We can use the aforementioned probabilities to simulate a dataset of 100 email recipients as follows:

sample_control = np.random.choice(['yes', 'no'], size=50, p=[.5, .5]) sample_name = np.random.choice(['yes', 'no'], size=50, p=[.65, .35])

This gives us two simulated samples, of 50 recipients each, who hypothetically saw the name or control email subject. Each one looks something like ['yes' 'no' 'no' 'no' 'yes' 'yes' ...], where 'yes' corresponds to an opened email.

Next, we can assemble these arrays into a data frame that looks a lot like the one we saw in exercise 1:

group = ['control']*50 + ['name']*50 outcome = list(sample_control) + list(sample_name) sim_data = {"Email": group, "Opened": outcome} sim_data = pd.DataFrame(sim_data) print(sim_data.head())


Email Opened
control no
control yes
control yes
control no
control no

Because of how we created this data frame, all of the “control” observations will be listed first, followed by all of the “name” observations.



In script.py, you’ll see the code from the narrative, which can be used to simulate a dataset for a Chi-Square test. You’ll notice that we’ve replaced all hard-coded numbers with the following variables: sample_size, control_rate, and name_rate (which is calculated using control_rate and lift).

Change the sample size to 4 and press “Run”. Inspect the output. Does it look as expected?


Press “Run” a few more times and notice how the data changes each time even though you haven’t changed the code. This happens because we’ve provided probabilities for the outcomes; (opened or not), rather than specific values.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?