We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data?

For data to be tidy, it must have:

  • Each variable as a separate column
  • Each row as a separate observation

For example, we would want to reshape a table like:

Account Checkings Savings
“12456543” 8500 8900
“12283942” 6410 8020
“12839485” 78000 92000

Into a table that looks more like:

Account Account Type Amount
“12456543” “Checking” 8500
“12456543” “Savings” 8900
“12283942” “Checking” 6410
“12283942” “Savings” 8020
“12839485” “Checking” 78000
“12839485” “Savings” 920000

The first step of diagnosing whether or not a dataset is tidy is using base R and dplyr functions to explore and probe the dataset.

You’ve seen most of the functions we often use to diagnose a dataset for cleaning. Some of the most useful ones are:

  • head() — display the first 6 rows of the table
  • summary() — display the summary statistics of the table
  • colnames() — display the column names of the table



Provided in notebook.Rmd are two data frames, grocery_1 and grocery_2.

Begin by viewing the head() of both grocery_1 and grocery_2.


Explore the data frames using the other functions listed.

Which data frame is “clean”, tidy, and ready for analysis? Create a variable named clean_data_frame and assign it the value 1 if grocery_1 is a clean and tidy data frame or 2 if grocery_2 is a clean and tidy data frame.

