Suppose that you own a chain of stores that sell ants, called VeryAnts. There are three different locations: A, B, and C. You want to know if the average ant sales over the past year are significantly different between the three locations.
At first, it seems that you could perform T-tests between each pair of stores.
You know that the p-value is the probability that you incorrectly reject the null hypothesis on each t-test. The more t-tests you perform, the more likely that you are to get a false positive, a Type I error.
For a p-value of 0.05
, if the null hypothesis is true, then the probability of obtaining a significant result is 1 – 0.05
= 0.95
. When you run another t-test, the probability of still getting a correct result is 0.95
* 0.95
, or 0.9025
. That means your probability of making an error is now close to 10%
! This error probability only gets bigger with the more t-tests you do.
Instructions
We have created samples store_a
, store_b
, and store_c
, representing the sales at VeryAnts at locations A, B, and C, respectively. We want to see if there’s a significant difference in sales between the three locations.
Explore datasets store_a
, store_b
, and store_c
by finding and viewing the means and standard deviations of each one. Store the means in variables called store_a_mean
, store_b_mean
, and store_c_mean
. Store the standard deviations in variables called store_a_sd
, store_b_sd
, and store_c_sd
.
Perform a Two Sample T-test between each pair of location data.
Store the results of the tests in variables called a_b_results
, a_c_results
, and b_c_results
. View the results for each test.
Store the probability of error for running three T-Tests in a variable called error_prob
. View error_prob
.