Learn

After importing a data file, we’re often eager to get started with exploratory data analysis. Unfortunately, raw data is often messy data. If we don’t take some time to inspect and clean a dataset, we may get meaningless or misleading results.

For example, let’s take a look at this table of monthly sales:

An Excel table with Month and Sales columns. The month column has entries: jan, jan, february, february, april, march, mar, and january. The sales column has entries: 1000, 100, 50, 30, 30, 22, 1000, 500

Suppose we want to know the sales associated with january. If we just dive in and filter the table on month equals january, we’ll get a value of 500.

The same Excel table, now filtered to only show the single row with month january and sales 500

But there are several other rows of the original table that ought to be associated with january! Unfortunately, Excel can’t tell that jan and january really mean the same thing. Even harder to spot is the fact that one of the rows labeled jan has three spaces at the end. We can’t see this extra whitespace easily, but Excel will treat it as a different category.

In a small dataset, these inconsistencies can be edited by hand. For larger datasets, Excel provides text cleaning functions that we can write once and then drag down a column like any other formula. You can even nest these functions (like in the last slide to the right), though nesting is outside the scope of this course.

We’ve placed a slideshow demonstrating common text-cleaning functions in the learning environment. When you’re ready to try these out yourself, move on to the next set of instructions!

Instructions

When you’re ready to practice what you’ve learned, download and work through our interactive exercise spreadsheet. Feel free to download our solutions spreadsheet to compare your work to ours!

Reflection Question: in the spreadsheet, why did we trim the data first?

Our Answer
It is usually best to trim whitespace first. The =LEFT() function does not distinguish between whitespace and other characters. If we don't trim first, =LEFT() could take a row with whitespace at the beginning and remove some of the actual data!

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?