We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data?
For data to be tidy, it must have:
- Each variable as a separate column
- Each row as a separate observation
For example, we would want to reshape a table like:
Account | Checkings | Savings |
---|---|---|
“12456543” | 8500 | 8900 |
“12283942” | 6410 | 8020 |
“12839485” | 78000 | 92000 |
Into a table that looks more like:
Account | Account Type | Amount |
---|---|---|
“12456543” | “Checking” | 8500 |
“12456543” | “Savings” | 8900 |
“12283942” | “Checking” | 6410 |
“12283942” | “Savings” | 8020 |
“12839485” | “Checking” | 78000 |
“12839485” | “Savings” | 92000 |
The first step of diagnosing whether or not a dataset is tidy is using base R and dplyr functions to explore and probe the dataset.
You’ve seen most of the functions we often use to diagnose a dataset for cleaning. Some of the most useful ones are:
head()
— display the first 6 rows of the tablesummary()
— display the summary statistics of the tablecolnames()
— display the column names of the table
Instructions
Provided in notebook.Rmd
are two data frames, grocery_1
and grocery_2
.
Begin by viewing the head()
of both grocery_1
and grocery_2
.
Explore the data frames using the other functions listed.
Which data frame is “clean”, tidy, and ready for analysis? Create a variable named clean_data_frame
and assign it the value 1
if grocery_1
is a clean and tidy data frame or 2
if grocery_2
is a clean and tidy data frame.