Before moving on to analysis, we also need to inspect and validate numeric data. There’s no one rule for what valid numeric data looks like, since the definition of “invalid” data depends on what we expect “valid” data to look like. For example, an entry of
-500 might be suspicious in a column of Celsius temperatures but reasonable in a column measuring the sea floor relative to sea level.
While there is no one rule for cleaning data, there are some common questions that can help guide our inspections:
- Are there any blank or missing data points?
- What range of numbers (or dates) are we expecting?
- Are there any data points that don’t meet those expectations?
We’ve already learned the tools we need to answer these questions! For example:
- missing data points: use the
Filtermenu and select
- unexpected values: use sorting or functions to inspect minimum and maximum values or trends over time
Of course, identifying invalid or messy data is only the first half of the process. Once we identify these datapoints we have to decide what to do with them. There is a whole science to handling missing data, so be sure to check out our course on Handling Missing Data before working with missing data.
For this course, imagine that you were asked to review the datasets to the right. You’ve opened them in Excel and are inspecting the data. What would you do in each case? Think critically about if you can (or should) use the data and what types of analysis would be valid, and what wouldn’t.
Flip the cards to read our answer.
When you’re ready to practice inspecting data for missing or suspect entries, move on to the next set of instructions!
When you’re ready to practice what you’ve learned, download and work through our interactive exercise spreadsheet.
Even if you get an exercise correct on the first try, select the other possible answers to see our explanations of why they are incorrect.
If you get stuck, feel free to download our solutions spreadsheet to compare your work to ours!
Reflection question: If there are no missing or suspect entries, are we good to proceed with analysis?