Learn

In order to efficiently store data, we often spread related information across multiple tables.

For instance, imagine that we own an e-commerce business and we want to track the products that have been ordered from our website.

We could have one table with all of the following information:

  • order_id
  • customer_id
  • customer_name
  • customer_address
  • customer_phone_number
  • product_id
  • product_description
  • product_price
  • quantity
  • timestamp

However, a lot of this information would be repeated. If the same customer makes multiple orders, that customer’s name, address, and phone number will be reported multiple times. If the same product is ordered by multiple customers, then the product price and description will be repeated. This will make our orders table big and unmanageable.

So instead, we can split our data into three tables:

  • orders would contain the information necessary to describe an order: order_id, customer_id, product_id, quantity, and timestamp
  • products would contain the information to describe each product: product_id, product_description and product_price
  • customers would contain the information for each customer: customer_id, customer_name, customer_address, and customer_phone_number

In this lesson, we will learn the dplyr commands that help us work with data stored in multiple tables.

Instructions

1.

In notebook.Rmd, we’ve loaded in three data frames: orders, products, and customers.

Begin by inspecting orders using head(). What columns are in the data frame?

2.

Now inspect products using head().

Are there any columns in common with orders?

3.

Finally inspect customers using head().

Are there any columns in common with orders or products?

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?