Learn

Text columns also frequently require cleaning in order to standardize them.

Changing Case

Python is case-sensitive which means it differentiates the words Mountains, mountains, and moUntaiNS. If capitalization is not uniform throughout a dataset, things like value counts can be incorrect.

We can change the capitalization structure of a text column using the following methods:

  • .lower() converts to lowercase
  • .upper() converts to uppercase
  • .title() converts to title case
Text Method Modified Text
Great Smoky mountains .lower() great smoky mountains
Great Smoky mountains .upper() GREAT SMOKY MOUNTAINS
Great Smoky mountains .title() Great Smoky Mountains

Removing Whitespaces

Many data processes accidentally introduce unwanted whitespace. We can remove this using the syntax

df['Column'] = df['Column'].str.strip()

Specifically, this method removes leading and trailing whitespaces in a string which can be difficult to identify:

Text Whitespaces Stripped Text
‘ Grand Canyon’ leading whitespaces ‘Grand Canyon’
‘Grand Canyon ‘ trailing whitespaces ‘Grand Canyon’
‘ Grand Canyon ‘ leading and trailing whitespaces ‘Grand Canyon’

Replacing Characters

The last method we’ll introduce to modify a text column is the .replace() method. This method replaces a specified character pattern with another character pattern using the following syntax:

df['Column'] = df['Column'].str.replace( pat='old_pattern', repl='new_pattern', regex=False)
  • The pat parameter takes the string we want to replace
  • The repl parameter takes the string we want to replace pat with
  • regex=False tells pandas to simply look for every case of old_pattern and replace with new_pattern (regex is a more advanced find-and-replace computer language, which we don’t need for now.)

Here is an example of replacing patterns:

Original Text Old Pattern New Pattern Replaced Text
‘Rocky.Mountains’ '.' ' ' ‘Rocky Mountains’

The .replace() method is often used to replace abbreviations, correct misspellings, and update text.

Instructions

1.

First, run the Setup cell to import libraries and datasets.

The texts in the ParkTitle column are not correctly titled. Specifically, the words 'park' and 'preserve' are all lowercase. Modify the text in the ParkTitle column to title case.

2.

When we split the Location column in the previous exercise, the resulting State column contained an unnecessary leading whitespace (ex: ' TN'). Modify the State column by removing the whitespace.

3.

In the ParkTitle column, the dataset author used '&' and 'And' interchangeably. For example, in rows 17 and 37 we have

- ParkTitle
17 New River Gorge National Park And Preserve
37 Great Sand Dunes National Park & Preserve

Modify the ParkTitle column by replacing every instance of & with And

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?