Skip to Content
Learn
Data Cleaning in R
String Parsing

Sometimes we need to modify strings in our data frames to help us transform them into more meaningful metrics. For example, in our fruits table from before:

item price calories
“banana” “$1” 105
“apple” “$0.75” 95
“peach” “$3” 55
“peach” “$4” 55
“clementine” “$2.5” 35

We can see that the 'price' column is actually composed of character strings representing dollar amounts. This column could be much better represented as numeric, so that we could take the mean, calculate other aggregate statistics, or compare different fruits to one another in terms of price.

First, we can use a regular expression, a sequence of characters that describe a pattern of text to be matched, to remove all of the dollar signs. The base R function gsub() will remove the $ from the price column, replacing the symbol with an empty string '':

fruit %>% mutate(price=gsub('\\$','',price))

Then, we can use the base R function as.numeric() to convert character strings containing numerical values to numeric:

fruit %>% mutate(price = as.numeric(price))

Now, we have a data frame that looks like:

item price calories
“banana” 1 105
“apple” 0.75 95
“peach” 3 55
“peach” 4 55
“clementine” 2.5 35

Instructions

1.

We saw in the last exercise that finding the mean of the score column is hard to do when the data is stored as characters and not numbers.

View the head() of students to take a look at the values in the score column.

2.

Remove the '%' symbol from the score column, and save the resulting data frame to students. View students.

3.

Convert the score column to a numerical type using the as.numeric() function. Save this new data frame to students, and view it.

Folder Icon

Take this course for free

Already have an account?