Key Concepts

Review core concepts you need to learn to master this subject

dplyr package

The dplyr package provides functions that perform data manipulation operations oriented to explore and manipulate datasets. At the most basic level, the package functions refers to data manipulation “verbs” such as select, filter, mutate, arrange, summarize among others that allow to chain multiple steps in a few lines of code. The dplyr package is suitable to work with a single dataset as well as to achieve complex results in large datasets.

dplyr summarise()

dplyr‘s summarise() function can collapse a data frame to a single row, summarising the specified columns by applying a summary function.

Summary functions, or functions that take a vector of values and return a single value, include: mean(), median(), sd() (standard deviation), var() (variance), min(), max(), IQR() (interquartile range), n_distinct() (number of unique values), and sum().

For example, to find the mean population and gdp from countries data frame:

countries %>% summarise(mean_pop = mean(population, na.rm = TRUE))

To find the mean and standard deviation of both the population and gdp columns from a countries data frame:

countries %>% summarise(mean_pop = mean(population, na.rm = TRUE), sd_pop = sd(population, na.rm = TRUE), mean_gdp = mean(gdp, na.rm = TRUE), sd_gdp = sd(gdp, na.rm = TRUE))

dplyr group_by()

dplyr‘s group_by() function can group together rows of a data frame with the same value(s) in either a specified column or multiple columns, allowing for the application of summary functions on the individual groups.

group_by() changes the unit of analysis from a complete dataset to individual groups.

For example, consider a data frame countries. To find the mean and standard deviation of the population column grouped by continent:

countries %>% group_by(continent) %>% summarise(mean_pop = mean(population, na.rm = TRUE), sd_pop = sd(population, na.rm = TRUE))

To find the mean math_score and reading_score by age and gender from a students data frame:

students %>% group_by(age, gender) %>% summarise(mean_math_score = mean(math_score, na.rm = TRUE), mean_reading_score = mean(reading_score, na.rm = TRUE))

filter() with group_by()

Combining dplyr‘s group_by() and filter() functions allows for the filtering of rows of a data frame based on per-group metrics.

Grouping a data frame by a selection of columns followed by a call to filter() allows for filtering a data frame based on per-group summary functions.

Given a students data frame, to keep all the rows of students whose per-age average math_score is less than 80:

students %>% group_by(age) %>% filter(mean(math_score, na.rm = TRUE) > 80)
  • group_by() groups the data frame by age
  • filter() will keep all the rows of the data frame whose per-group (per-age) average math_score is greater than 80

groub_by() with mutate()

Combining dplyr‘s group_by() and mutate() functions allows for the creation of new variables that involve per-group metrics in their calculation.

Grouping a data frame by a selection of columns followed by a call to mutate() allows for the creation of new columns based on per-group summary functions.

Given a students data frame, to add a new column containing the difference between a student’s score and the mean math_score for each student’s respective age group:

students %>% group_by(age) %>% mutate(diff_from_age_mean = math_score - mean(math_score, na.rm = TRUE)) `
  • group_by() groups the data frame by age
  • mutate() will add a new column diff_from_age_mean which is calculated as the difference between a row’s individual math_score and the mean(math_score) for that row’s age-group
Aggregates in R
Lesson 1 of 1
  1. 1
    In this lesson you will learn about aggregates in R using dplyr. An aggregate statistic is a way of creating a single number that describes a group of numbers. Common aggregate statistics include m…
  2. 2
    In this exercise, you will learn how to combine all of the values from a column for a single calculation. This can be done with the help of the dplyr function summarize(), which returns a new da…
  3. 3
    When we have a bunch of data, we often want to calculate aggregate statistics (mean, standard deviation, median, percentiles, etc.) over certain subsets of the data. Suppose we have a grade book w…
  4. 4
    Sometimes, we want to group by more than one column. We can do this by passing multiple column names as arguments to the group_by function. Imagine that we run a chain of stores and have data abo…
  5. 5
    While group_by() is most often used with summarize() to calculate summary statistics, it can also be used with the dplyr function filter() to filter rows of a data frame based on per-group metrics….
  6. 6
    group_by() can also be used with the dplyr function mutate() to add columns to a data frame that involve per-group metrics. Consider the same educational technology company’s enrollments table fro…
  7. 7
    This lesson introduced you to aggregates in R using dplyr. You learned: How to calculate summary statistics with summarize() How to perform aggregate statistics over individual rows with the…

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo