In the last exercise, you saw that the probability of making a Type I error got dangerously high as you performed more t-tests.
When comparing more than two numerical datasets, the best way to preserve a Type I error probability of 0.05
is to use ANOVA. ANOVA (Analysis of Variance) tests the null hypothesis that all of the datasets you are considering have the same mean. If you reject the null hypothesis with ANOVA, you’re saying that at least one of the sets has a different mean; however, it does not tell you which datasets are different.
You can use the stats
package function aov()
to perform ANOVA on multiple datasets. aov()
takes the different datasets combined into a data frame as an argument. For example, if you were comparing scores on a video game between math majors, writing majors, and psychology majors, you could format the data in a data frame df_scores
as follows:
group | score |
---|---|
math major | 88 |
math major | 81 |
writing major | 92 |
writing major | 80 |
psychology major | 94 |
psychology major | 83 |
You can then run an ANOVA test with this line:
results <- aov(score ~ group, data = df_scores)
Note: score ~ group
indicates the relationship you want to analyze (i.e. how each group
, or major, relates to score
on the video game)
To retrieve the p-value from the results of calling aov()
, use the summary()
function:
summary(results)
The null hypothesis, in this case, is that all three populations have the same mean score on this video game. If you reject this null hypothesis (if the p-value is less than 0.05
), you can say you are reasonably confident that a pair of datasets is significantly different. After using only ANOVA, however, you can’t make any conclusions on which two populations have a significant difference.
Let’s look at an example of ANOVA in action.
Instructions
We’ve reformatted the store data from the last exercise into a data frame stores
. View stores
to see what columns it contains.
Open the hint for an explanation of the columns.
Perform an ANOVA on the stores
data and save the test results to a variable results
. Use the summary()
function to view the p-value of the test. Does this p-value lead you to reject the null hypothesis?
Let’s say the sales at location B have suddenly soared (maybe there’s an ant convention happening nearby). The new sales for location B have been updated in the stores_new
data frame.
Re-run the ANOVA test on stores_new
and save the test results to a variable results_new
. Use the summary()
function to see what the p-value is now. Does this new value make sense?