After importing a data file, we’re often eager to get started with exploratory data analysis. Unfortunately, raw data is often messy data. If we don’t take some time to inspect and clean a dataset, we may get meaningless or misleading results.
For example, let’s take a look at this table of monthly sales:
Suppose we want to know the sales associated with january
. If we just dive in and filter the table on month equals january
, we’ll get a value of 500.
But there are several other rows of the original table that ought to be associated with january
! Unfortunately, Excel can’t tell that jan
and january
really mean the same thing. Even harder to spot is the fact that one of the rows labeled jan
has three spaces at the end. We can’t see this extra whitespace easily, but Excel will treat it as a different category.
In a small dataset, these inconsistencies can be edited by hand. For larger datasets, Excel provides text cleaning functions that we can write once and then drag down a column like any other formula. You can even nest these functions (like in the last slide to the right), though nesting is outside the scope of this course.
We’ve placed a slideshow demonstrating common text-cleaning functions in the learning environment. When you’re ready to try these out yourself, move on to the next set of instructions!
Instructions
When you’re ready to practice what you’ve learned, download and work through our interactive exercise spreadsheet. Feel free to download our solutions spreadsheet to compare your work to ours!
Reflection Question: in the spreadsheet, why did we trim the data first?