Learn

While group_by() is most often used with summarize() to calculate summary statistics, it can also be used with the dplyr function filter() to filter rows of a data frame based on per-group metrics.

Suppose you work at an educational technology company that offers online courses and collects user data in an enrollments data frame:

user_id course quiz_score
1234 learn_r 80
1234 learn_python 95
4567 learn_r 90
4567 learn_python 55

You want to identify all the enrollments in difficult courses, which you define as courses with an average quiz_score less than 80. To filter the data frame to just these rows:

enrollments %>% group_by(course) %>% filter(mean(quiz_score) < 80)
  • group_by() groups the data frame by course into two groups: learn-r and learn-python
  • filter() will keep all the rows of the data frame whose per-group (per-course) average quiz_score is less than 80

Rather than filtering rows by the individual column values, the rows will be filtered by their group value since a summary function is used! The resulting data frame would look like this:

user_id course quiz_score
1234 learn_python 95
4567 learn_python 55
  • The average quiz_score for the learn-r course is 85, so all the rows of enrollments with a value of learn-r in the course column are filtered out.
  • The average quiz_score for the learn-python course is 75, so all the rows of enrollments with a value of learn-python in the course column remain.

Instructions

1.

Your boss at ShoeFly.com wants to gain a better insight into the orders of the most popular shoe_types.

Group orders by shoe_type and filter to only include orders with a shoe_type that has been ordered more than 16 times. Save the result to most_pop_orders, and view it.

You can include any of the summary functions as part of an argument to filter(), including n()!

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?