After importing a data file, we’re often eager to get started with exploratory data analysis. Unfortunately, raw data is often messy data. If we don’t take some time to inspect and clean a dataset, we may get meaningless or misleading results.
For example, let’s take a look at this table of monthly sales:
Suppose we want to know the sales associated with
january. If we just dive in and filter the table on
month equals january, we’ll get a value of 500.
But there are several other rows of the original table that ought to be associated with
january! Unfortunately, Excel can’t tell that
january really mean the same thing. Even harder to spot is the fact that one of the rows labeled
jan has three spaces at the end. We can’t see this extra whitespace easily, but Excel will treat it as a different category.
In a small dataset, these inconsistencies can be edited by hand. For larger datasets, Excel provides text cleaning functions that we can write once and then drag down a column like any other formula. You can even nest these functions (like in the last slide to the right), though nesting is outside the scope of this course.
We’ve placed a slideshow demonstrating common text-cleaning functions in the learning environment. When you’re ready to try these out yourself, move on to the next set of instructions!
Reflection Question: in the spreadsheet, why did we trim the data first?