In the previous exercise, we learned how to use rename()
to join two data frames whose columns don’t match.
A better option, however, exists. We can add the by
argument when calling inner_join()
to specify which columns we want to join on. In the example below, the “left” table is the one that comes first (orders
), and the “right” table is the one that comes second (customers
). This syntax says that we should match the customer_id
from orders to the id
in customers
.
orders %>% inner_join(customers, by = c('customer_id' = 'id'))
The resulting data frame will look like this:
id | customer_id | product_id | quantity | timestamp | customer_name | address | phone_number |
---|---|---|---|---|---|---|---|
1 | 2 | 3 | 1 | 2017-01-01 00:00:00 | Jane Doe | 456 Park Ave | 949-867-5309 |
2 | 2 | 2 | 3 | 2017-01-01 00:00:00 | Jane Doe | 456 Park Ave | 949-867-5309 |
3 | 3 | 1 | 1 | 2017-01-01 00:00:00 | Joe Schmo | 789 Broadway | 112-358-1321 |
4 | 3 | 2 | 2 | 2016-02-01 00:00:00 | Joe Schmo | 789 Broadway | 112-358-1321 |
5 | 3 | 3 | 3 | 2017-02-01 00:00:00 | Joe Schmo | 789 Broadway | 112-358-1321 |
6 | 1 | 4 | 2 | 2017-03-01 00:00:00 | John Smith | 123 Main St. | 212-123-4567 |
7 | 1 | 1 | 1 | 2017-02-02 00:00:00 | John Smith | 123 Main St. | 212-123-4567 |
8 | 1 | 4 | 1 | 2017-02-02 00:00:00 | John Smith | 123 Main St. | 212-123-4567 |
If we use this syntax, we’ll end up with two columns called id
, one from the first table and one from the second. R won’t let you have two columns with the same name, so it will change them to id_x
and id_y
.
It will look like this:
id_x | customer_id | product_id | quantity | timestamp | id_y | customer_name | address | phone_number |
---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 1 | 2017-01-01 00:00:00 | 2 | Jane Doe | 456 Park Ave | 949-867-5309 |
2 | 2 | 2 | 3 | 2017-01-01 00:00:00 | 2 | Jane Doe | 456 Park Ave | 949-867-5309 |
3 | 3 | 1 | 1 | 2017-01-01 00:00:00 | 3 | Joe Schmo | 789 Broadway | 112-358-1321 |
4 | 3 | 2 | 2 | 2016-02-01 00:00:00 | 3 | Joe Schmo | 789 Broadway | 112-358-1321 |
5 | 3 | 3 | 3 | 2017-02-01 00:00:00 | 3 | Joe Schmo | 789 Broadway | 112-358-1321 |
6 | 1 | 4 | 2 | 2017-03-01 00:00:00 | 1 | John Smith | 123 Main St. | 212-123-4567 |
7 | 1 | 1 | 1 | 2017-02-02 00:00:00 | 1 | John Smith | 123 Main St. | 212-123-4567 |
8 | 1 | 4 | 1 | 2017-02-02 00:00:00 | 1 | John Smith | 123 Main St. | 212-123-4567 |
The new column names id_x
and id_y
aren’t very helpful for us when we read the table. We can help make them more useful by using the keyword suffix
. We can provide a vector of suffixes to use instead of “_x” and “_y”.
For example, we could use the following code to make the suffixes reflect the table names:
orders %>% inner_join(customers, by = c('customer_id' = 'id'), suffix = c('_order','_customer'))
The resulting table would look like this:
id_order | customer_id | product_id | quantity | timestamp | id_customer | customer_name | address | phone_number |
---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 1 | 2017-01-01 00:00:00 | 2 | Jane Doe | 456 Park Ave | 949-867-5309 |
2 | 2 | 2 | 3 | 2017-01-01 00:00:00 | 2 | Jane Doe | 456 Park Ave | 949-867-5309 |
3 | 3 | 1 | 1 | 2017-01-01 00:00:00 | 3 | Joe Schmo | 789 Broadway | 112-358-1321 |
4 | 3 | 2 | 2 | 2016-02-01 00:00:00 | 3 | Joe Schmo | 789 Broadway | 112-358-1321 |
5 | 3 | 3 | 3 | 2017-02-01 00:00:00 | 3 | Joe Schmo | 789 Broadway | 112-358-1321 |
6 | 1 | 4 | 2 | 2017-03-01 00:00:00 | 1 | John Smith | 123 Main St. | 212-123-4567 |
7 | 1 | 1 | 1 | 2017-02-02 00:00:00 | 1 | John Smith | 123 Main St. | 212-123-4567 |
8 | 1 | 4 | 1 | 2017-02-02 00:00:00 | 1 | John Smith | 123 Main St. | 212-123-4567 |
Instructions
Join the orders
and products
data frames using an inner_join()
, with orders
as the first argument and products
as the second argument. Also include the by
argument to indicate which columns to join on. Save your results to the variable orders_products
, and view it.
Now join the products
and orders
data frames using an inner_join()
, with products
as the first argument and orders
as the second argument. Also include the by
argument to indicate which columns to join on, as well as a suffix
argument c('_product','_order')
. Save your results to the variable products_orders
, and view it.