Skip to Content
Learn
Aggregates in R
Calculating Aggregate Functions I

When we have a bunch of data, we often want to calculate aggregate statistics (mean, standard deviation, median, percentiles, etc.) over certain subsets of the data.

Suppose we have a grade book with columns student, assignment_name, and grade:

student assignment_name grade
Amy Assignment 1 96
Amy Assignment 2 87
Bob Assignment 1 91
Bob Assignment 2 75
Chris Assignment 1 83
Chris Assignment 2 88

We want to get an average grade for each student across all assignments. We can do this using the helpful dplyr function group_by().

For this example, we’d use the following piece of code:

grades <- df %>% group_by(student) %>% summarize(mean_grade = mean(grade))

The output might look something like this:

student mean_grade
Amy 91.5
Bob 83
Chris 85.5

In general, we use the following syntax to calculate aggregates:

df %>% group_by(column_1) %>% summarize(aggregate_name = command(column_2))
  • column_1 (student in our example) is the column that we want to group_by()
  • column_2 (grade in our example) is the column that we want to apply command(), a summary function, to using summarize()
  • aggregate_name is the name assigned to the calculated aggregate

In addition to the summary functions discussed in the last exercise (mean(), median(), sd(), var(), min(), max(), IQR() and n_distinct()), another helpful summary function, especially for grouped data, is n(). n() will return the count of the rows within a group, and does not require a column as an argument. To get the count of the rows in each group of students from our example:

grades <- df %>% group_by(student) %>% summarize(count = n())

Instructions

1.

Let’s return to our orders data from ShoeFly.com.

In the previous exercise, our finance department wanted to know the most expensive shoe that we sold.

Now, they want to know the price of the most expensive shoe for each shoe_type (i.e., the price of the most expensive boot, the price of the most expensive ballet flat, etc.).

Save your answer to the variable pricey_shoes, and view it.

2.

The inventory team wants to know how many of each shoe_type has been sold so they can forecast inventory for the future.

Save your answer to the variable shoes_sold, and view it.

Folder Icon

Take this course for free

Already have an account?