While group_by()
is most often used with summarize()
to calculate summary statistics, it can also be used with the dplyr function filter()
to filter rows of a data frame based on per-group metrics.
Suppose you work at an educational technology company that offers online courses and collects user data in an enrollments
data frame:
user_id | course | quiz_score |
---|---|---|
1234 | learn_r | 80 |
1234 | learn_python | 95 |
4567 | learn_r | 90 |
4567 | learn_python | 55 |
You want to identify all the enrollments in difficult courses, which you define as courses with an average quiz_score
less than 80
. To filter the data frame to just these rows:
enrollments %>% group_by(course) %>% filter(mean(quiz_score) < 80)
group_by()
groups the data frame bycourse
into two groups:learn-r
andlearn-python
filter()
will keep all the rows of the data frame whose per-group (per-course) averagequiz_score
is less than80
Rather than filtering rows by the individual column values, the rows will be filtered by their group value since a summary function is used! The resulting data frame would look like this:
user_id | course | quiz_score |
---|---|---|
1234 | learn_python | 95 |
4567 | learn_python | 55 |
- The average
quiz_score
for thelearn-r
course is85
, so all the rows ofenrollments
with a value oflearn-r
in thecourse
column are filtered out. - The average
quiz_score
for thelearn-python
course is75
, so all the rows ofenrollments
with a value oflearn-python
in thecourse
column remain.
Instructions
Your boss at ShoeFly.com wants to gain a better insight into the orders of the most popular shoe_type
s.
Group orders
by shoe_type
and filter to only include orders with a shoe_type
that has been ordered more than 16
times. Save the result to most_pop_orders
, and view it.
You can include any of the summary functions as part of an argument to filter()
, including n()
!