In this exercise, you will learn how to combine all of the values from a column for a single calculation. This can be done with the help of the dplyr function summarize()
, which returns a new data frame containing the desired calculation.
Some examples of this type of calculation include:
- The data frame
customers
contains the names and ages of all of your customers. You want to find the median age:
customers %>% select(age) # c(23, 25, 31, 35, 35, 46, 62) customers %>% summarize(median_age = median(age)) # 35
- The data frame
shipments
contains address information for all shipments that you’ve sent out in the past year. You want to know how many different states you have shipped to.
shipments %>% select(states) # c('CA', 'CA', 'CA', 'CA', 'NY', 'NY', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ') shipments %>% summarize(n_distinct_states = n_distinct(states)) # 3
- The data frame
inventory
contains a list of types of t-shirts that your company makes. You want to know the standard deviation of the prices of your inventory.
inventory %>% select(price) # c(31, 23, 30, 27, 30, 22, 27, 22, 39, 27, 36) inventory %>% summarize(sd_price = sd(price)) # 5.465595
The general syntax for these calculations is:
df %>% summarize(var_name = command(column_name))
df
is the data frame you are working withsummarize
is a dplyr function that reduces multiple values to a single valuevar_name
is the name you assign to the column that stores the results of the summary function in the returned data framecommand
is the summary function that is applied to the column bysummarize()
column_name
is the name of the column ofdf
that is being summarized
The following table includes common summary functions that can be given as an argument to summarize()
:
Command | Description |
---|---|
mean() |
Average of all values in column |
median() |
Median value of column |
sd() |
Standard deviation of column |
var() |
Variance of column |
min() |
Minimum value in column |
max() |
Maximum value in column |
IQR() |
Interquartile range of column |
n_distinct() |
Number of unique values in column |
sum() |
Sum values of column |
Instructions
ShoeFly.com has a new batch of orders stored in the data frame orders
. Inspect the first 10
rows of the data frame using head()
.
Our finance department wants to know the price of the most expensive pair of shoes purchased. Save your answer to the variable most_expensive
.
Woah, wait a minute! Take a look at the output of the code you just ran. The result for the most expensive pair of shoes is coming back as NA
. Why is this happening?
If you scroll up in the rendered notebook to where orders.csv
is loaded, you can see a warning about row 99
of the file. There is a missing column of information! It appears that the price
for row 99
was not in the file, and this is causing your maximum value calculation to return NA
.
Add the following as an additional argument to max()
so that it removes all missing values before computing the maximum value.
na.rm = TRUE
Our fashion department wants to know how many different colors of shoes we are selling. Save your answer to the variable num_colors
.