In the last exercise, we used some data from an A/B test to run a Chi-Square test. In the next few exercises, we’ll build up a simulation to understand the considerations that go into choosing a sample size for that test.

Again consider the A/B test example from the previous exercise, comparing email subjects with and without the recipient’s first name. Suppose we know that visitors have a 50% chance of opening the control email and a 65% chance of opening the name email (30% lift!).

Here we use **lift** to refer to the inherent difference in the distributions of our two groups of data. In the *A/B Testing: Sample Size Calculators* lesson, we learned that **minimum detectable effect** is the smallest size of the difference between the two groups that we want our test to be able to detect. If we set up our experiment with a minimum detectable effect of at least 20%, our statistical test should detect a difference with a “lift” or “effect” of 20% or greater. In this lesson we are going to simulate data that has a lift of 30% to demonstrate how the inherent lift impacts the power of our statistical test.

We can use the aforementioned probabilities to simulate a dataset of 100 email recipients as follows:

sample_control = np.random.choice(['yes', 'no'], size=50, p=[.5, .5]) sample_name = np.random.choice(['yes', 'no'], size=50, p=[.65, .35])

This gives us two simulated samples, of 50 recipients each, who hypothetically saw the name or control email subject. Each one looks something like `['yes' 'no' 'no' 'no' 'yes' 'yes' ...]`

, where `'yes'`

corresponds to an opened email.

Next, we can assemble these arrays into a data frame that looks a lot like the one we saw in exercise 1:

group = ['control']*50 + ['name']*50 outcome = list(sample_control) + list(sample_name) sim_data = {"Email": group, "Opened": outcome} sim_data = pd.DataFrame(sim_data) print(sim_data.head())

Output:

Opened | |
---|---|

control | no |

control | yes |

control | yes |

control | no |

control | no |

Because of how we created this data frame, all of the “control” observations will be listed first, followed by all of the “name” observations.

### Instructions

**1.**

In **script.py**, you’ll see the code from the narrative, which can be used to simulate a dataset for a Chi-Square test. You’ll notice that we’ve replaced all hard-coded numbers with the following variables: `sample_size`

, `control_rate`

, and `name_rate`

(which is calculated using `control_rate`

and `lift`

).

Change the sample size to `4`

and press “Run”. Inspect the output. Does it look as expected?

**2.**

Press “Run” a few more times and notice how the data changes each time even though you haven’t changed the code. This happens because we’ve provided probabilities for the outcomes; (opened or not), rather than specific values.