Text columns also frequently require cleaning in order to standardize them.
Changing Case
Python is case-sensitive which means it differentiates the words Mountains, mountains, and moUntaiNS. If capitalization is not uniform throughout a dataset, things like value counts can be incorrect.
We can change the capitalization structure of a text column using the following methods:
.lower()
converts to lowercase.upper()
converts to uppercase.title()
converts to title case
Text | Method | Modified Text |
---|---|---|
Great Smoky mountains | .lower() |
great smoky mountains |
Great Smoky mountains | .upper() |
GREAT SMOKY MOUNTAINS |
Great Smoky mountains | .title() |
Great Smoky Mountains |
Removing Whitespaces
Many data processes accidentally introduce unwanted whitespace. We can remove this using the syntax
df['Column'] = df['Column'].str.strip()
Specifically, this method removes leading and trailing whitespaces in a string which can be difficult to identify:
Text | Whitespaces | Stripped Text |
---|---|---|
‘ Grand Canyon’ | leading whitespaces | ‘Grand Canyon’ |
‘Grand Canyon ‘ | trailing whitespaces | ‘Grand Canyon’ |
‘ Grand Canyon ‘ | leading and trailing whitespaces | ‘Grand Canyon’ |
Replacing Characters
The last method we’ll introduce to modify a text column is the .replace()
method. This method replaces a specified character pattern with another character pattern using the following syntax:
df['Column'] = df['Column'].str.replace( pat='old_pattern', repl='new_pattern', regex=False)
- The
pat
parameter takes the string we want to replace - The
repl
parameter takes the string we want to replacepat
with regex=False
tells pandas to simply look for every case ofold_pattern
and replace withnew_pattern
(regex
is a more advanced find-and-replace computer language, which we don’t need for now.)
Here is an example of replacing patterns:
Original Text | Old Pattern | New Pattern | Replaced Text |
---|---|---|---|
‘Rocky.Mountains’ | '.' |
' ' |
‘Rocky Mountains’ |
The .replace()
method is often used to replace abbreviations, correct misspellings, and update text.
Instructions
First, run the Setup
cell to import libraries and datasets.
The texts in the ParkTitle
column are not correctly titled. Specifically, the words 'park'
and 'preserve'
are all lowercase. Modify the text in the ParkTitle
column to title case.
When we split the Location
column in the previous exercise, the resulting State
column contained an unnecessary leading whitespace (ex: ' TN'
). Modify the State
column by removing the whitespace.
In the ParkTitle
column, the dataset author used '&'
and 'And'
interchangeably. For example, in rows 17 and 37 we have
- | ParkTitle |
---|---|
17 | New River Gorge National Park And Preserve |
37 | Great Sand Dunes National Park & Preserve |
Modify the ParkTitle
column by replacing every instance of &
with And