In the last exercise, you saw that the probability of making a Type I error got dangerously high as you performed more t-tests.

When comparing more than two numerical datasets, the best way to preserve a Type I error probability of `0.05`

is to use ANOVA. *ANOVA (Analysis of Variance)* tests the null hypothesis that all of the datasets you are considering have the same mean. If you reject the null hypothesis with ANOVA, you’re saying that at least one of the sets has a different mean; however, it does not tell you which datasets are different.

You can use the `stats`

package function `aov()`

to perform ANOVA on multiple datasets. `aov()`

takes the different datasets combined into a data frame as an argument. For example, if you were comparing scores on a video game between math majors, writing majors, and psychology majors, you could format the data in a data frame `df_scores`

as follows:

group | score |
---|---|

math major | 88 |

math major | 81 |

writing major | 92 |

writing major | 80 |

psychology major | 94 |

psychology major | 83 |

You can then run an ANOVA test with this line:

results <- aov(score ~ group, data = df_scores)

Note: `score ~ group`

indicates the relationship you want to analyze (i.e. how each `group`

, or major, relates to `score`

on the video game)

To retrieve the p-value from the results of calling `aov()`

, use the `summary()`

function:

summary(results)

The null hypothesis, in this case, is that all three populations have the same mean score on this video game. If you reject this null hypothesis (if the p-value is less than `0.05`

), you can say you are reasonably confident that a pair of datasets is significantly different. After using only ANOVA, however, you can’t make any conclusions on which two populations have a significant difference.

Let’s look at an example of ANOVA in action.

### Instructions

**1.**

We’ve reformatted the store data from the last exercise into a data frame `stores`

. View `stores`

to see what columns it contains.

Open the hint for an explanation of the columns.

**2.**

Perform an ANOVA on the `stores`

data and save the test results to a variable `results`

. Use the `summary()`

function to view the p-value of the test. Does this p-value lead you to reject the null hypothesis?

**3.**

Let’s say the sales at location B have suddenly *soared* (maybe there’s an ant convention happening nearby). The new sales for location B have been updated in the `stores_new`

data frame.

Re-run the ANOVA test on `stores_new`

and save the test results to a variable `results_new`

. Use the `summary()`

function to see what the p-value is now. Does this new value make sense?