In trying to get clean data, we want to make sure each column represents one type of measurement. Often, multiple measurements are recorded in the same column, and we want to separate these out so that we can do individual analysis on each variable.
Let’s say we have a column “birthday” with data formatted in MMDDYYYY format. In other words, “11011993” represents a birthday of November 1, 1993. We want to split this data into day, month, and year so that we can use these columns as separate features.
In this case, we know the exact structure of these strings. The first two characters will always correspond to the month, the second two to the day, and the rest of the string will always correspond to year. We can easily break the data into three separate columns by splitting the strings into substrings using str_sub()
, a helpful function from the stringr package:
# Create the 'month' column df %>% mutate(month = str_sub(birthday,1,2)) # Create the 'day' column df %>% mutate(day = str_sub(birthday,3,4)) # Create the 'year' column df %>% mutate(year = str_sub(birthday,5))
- The first command takes the characters starting at index
1
and ending at index2
of each value in thebirthday
column and puts it into amonth
column. - The second command takes the characters starting at index
3
and ending at index4
of each value in thebirthday
column and puts it into aday
column. - The third command takes the characters starting at index
5
and ending at the end of the value in thebirthday
column and puts it into ayear
column.
This would transform a table like:
id | birthday |
---|---|
1011 | “12241989” |
1112 | “10311966” |
1113 | “01052011” |
into a table like:
id | birthday | month | day | year |
---|---|---|---|---|
1011 | “12241989” | “12” | “24” | “1989” |
1112 | “10311966” | “10” | “31” | “1966” |
1113 | “01052011” | “01” | “05” | “2011” |
We will practice changing string columns into numerical columns (like converting "10"
to 10
) in a future exercise.
Instructions
Print out the columns of the students
data frame.
The column gender_age
sounds like it contains both gender
and age
!
View the head()
of students
to see what kind of data gender_age
contains.
It looks like the first character of the values in gender_age
contains the gender, while the rest of the string contains the age. Let’s separate out the gender data into a new column called gender
. Save the result to students
, and view the head()
.
Now, separate out the age data into a new column called age
. Save the updated data frame to students
, and view the head()
.
Now, we don’t need that gender_age
column anymore. Drop gender_age
from students
, and save the result to students
. View the head()
of students
.