Learn

Before moving on to analysis, we also need to inspect and validate numeric data. There’s no one rule for what valid numeric data looks like, since the definition of “invalid” data depends on what we expect “valid” data to look like. For example, an entry of -500 might be suspicious in a column of Celsius temperatures but reasonable in a column measuring the sea floor relative to sea level.

While there is no one rule for cleaning data, there are some common questions that can help guide our inspections:

  1. Are there any blank or missing data points?
  2. What range of numbers (or dates) are we expecting?
  3. Are there any data points that don’t meet those expectations?

We’ve already learned the tools we need to answer these questions! For example:

  • missing data points: use the Filter menu and select (Blanks)
  • unexpected values: use sorting or functions to inspect minimum and maximum values or trends over time

Of course, identifying invalid or messy data is only the first half of the process. Once we identify these datapoints we have to decide what to do with them. There is a whole science to handling missing data, so be sure to check out our course on Handling Missing Data before working with missing data.

For this course, imagine that you were asked to review the datasets to the right. You’ve opened them in Excel and are inspecting the data. What would you do in each case? Think critically about if you can (or should) use the data and what types of analysis would be valid, and what wouldn’t.

Flip the cards to read our answer.

When you’re ready to practice inspecting data for missing or suspect entries, move on to the next set of instructions!

Instructions

When you’re ready to practice what you’ve learned, download and work through our interactive exercise spreadsheet.

Even if you get an exercise correct on the first try, select the other possible answers to see our explanations of why they are incorrect.

If you get stuck, feel free to download our solutions spreadsheet to compare your work to ours!

Reflection question: If there are no missing or suspect entries, are we good to proceed with analysis?

Our Answer
It is still possible that there's some missing data we simply can't detect. It is also possible that there is incorrect data that doesn't stand out as unusual. This is why it is important to know where your data is coming from and how it was handled before you started working on it. A dataset with no details on its origin, method of collection, or chain of custody cannot be trusted to produce accurate results.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?