Learn

In the last exercise, you saw that the probability of making a Type I error got dangerously high as you performed more t-tests.

When comparing more than two numerical datasets, the best way to preserve a Type I error probability of 0.05 is to use ANOVA. ANOVA (Analysis of Variance) tests the null hypothesis that all of the datasets you are considering have the same mean. If you reject the null hypothesis with ANOVA, you’re saying that at least one of the sets has a different mean; however, it does not tell you which datasets are different.

You can use the stats package function aov() to perform ANOVA on multiple datasets. aov() takes the different datasets combined into a data frame as an argument. For example, if you were comparing scores on a video game between math majors, writing majors, and psychology majors, you could format the data in a data frame df_scores as follows:

group score
math major 88
math major 81
writing major 92
writing major 80
psychology major 94
psychology major 83

You can then run an ANOVA test with this line:

results <- aov(score ~ group, data = df_scores)

Note: score ~ group indicates the relationship you want to analyze (i.e. how each group, or major, relates to score on the video game)

To retrieve the p-value from the results of calling aov(), use the summary() function:

summary(results)

The null hypothesis, in this case, is that all three populations have the same mean score on this video game. If you reject this null hypothesis (if the p-value is less than 0.05), you can say you are reasonably confident that a pair of datasets is significantly different. After using only ANOVA, however, you can’t make any conclusions on which two populations have a significant difference.

Let’s look at an example of ANOVA in action.

Instructions

1.

We’ve reformatted the store data from the last exercise into a data frame stores. View stores to see what columns it contains.

Open the hint for an explanation of the columns.

2.

Perform an ANOVA on the stores data and save the test results to a variable results. Use the summary() function to view the p-value of the test. Does this p-value lead you to reject the null hypothesis?

3.

Let’s say the sales at location B have suddenly soared (maybe there’s an ant convention happening nearby). The new sales for location B have been updated in the stores_new data frame.

Re-run the ANOVA test on stores_new and save the test results to a variable results_new. Use the summary() function to see what the p-value is now. Does this new value make sense?

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?