In the last exercise, you saw that the probability of making a Type I error got dangerously high as you performed more t-tests.
When comparing more than two numerical datasets, the best way to preserve a Type I error probability of
0.05 is to use ANOVA. ANOVA (Analysis of Variance) tests the null hypothesis that all of the datasets you are considering have the same mean. If you reject the null hypothesis with ANOVA, you’re saying that at least one of the sets has a different mean; however, it does not tell you which datasets are different.
You can use the
stats package function
aov() to perform ANOVA on multiple datasets.
aov() takes the different datasets combined into a data frame as an argument. For example, if you were comparing scores on a video game between math majors, writing majors, and psychology majors, you could format the data in a data frame
df_scores as follows:
You can then run an ANOVA test with this line:
results <- aov(score ~ group, data = df_scores)
score ~ group indicates the relationship you want to analyze (i.e. how each
group, or major, relates to
score on the video game)
To retrieve the p-value from the results of calling
aov(), use the
The null hypothesis, in this case, is that all three populations have the same mean score on this video game. If you reject this null hypothesis (if the p-value is less than
0.05), you can say you are reasonably confident that a pair of datasets is significantly different. After using only ANOVA, however, you can’t make any conclusions on which two populations have a significant difference.
Let’s look at an example of ANOVA in action.
We’ve reformatted the store data from the last exercise into a data frame
stores to see what columns it contains.
Open the hint for an explanation of the columns.
Perform an ANOVA on the
stores data and save the test results to a variable
results. Use the
summary() function to view the p-value of the test. Does this p-value lead you to reject the null hypothesis?
Let’s say the sales at location B have suddenly soared (maybe there’s an ant convention happening nearby). The new sales for location B have been updated in the
stores_new data frame.
Re-run the ANOVA test on
stores_new and save the test results to a variable
results_new. Use the
summary() function to see what the p-value is now. Does this new value make sense?