Learn
Data Cleaning with Pandas
Diagnose the Data

We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data?

For data to be tidy, it must have:

  • Each variable as a separate column
  • Each row as a separate observation

For example, we would want to reshape a table like:

Account Checkings Savings
“12456543” 8500 8900
“12283942” 6410 8020
“12839485” 78000 92000

Into a table that looks more like:

Account Account Type Amount
“12456543” “Checking” 8500
“12456543” “Savings” 8900
“12283942” “Checking” 6410
“12283942” “Savings” 8020
“12839485” “Checking” 78000
“12839485” “Savings” 920000

The first step of diagnosing whether or not a dataset is tidy is using pandas functions to explore and probe the dataset.

You’ve seen most of the functions we often use to diagnose a dataset for cleaning. Some of the most useful ones are:

  • .head() — display the first 5 rows of the table
  • .info() — display a summary of the table
  • .describe() — display the summary statistics of the table
  • .columns — display the column names of the table
  • .value_counts() — display the distinct values for a column

Instructions

1.

We have provided two DataFrames, df1 and df2.

Inspect both of these DataFrames using the functions listed above.

Start by printing the .head() of both df1 and df2.

2.

Explore the DataFrames using the other functions listed.

Which DataFrame is “clean”, and ready for analysis? Create a variable named clean and set it equal to 1 for df1 and 2 for df2.

Folder Icon

Sign up to start coding

Already have an account?