Often we see duplicated rows of data in the data frames we are working with. This could happen due to errors in data collection or in saving and loading the data.
To check for duplicates, we can use the base R function duplicated()
, which will return a logical vector telling us which rows are duplicate rows.
Let’s say we have a data frame fruits
that represents this table:
item | price | calories |
---|---|---|
“banana” | “$1” | 105 |
“apple” | “$0.75” | 95 |
“apple” | “$0.75” | 95 |
“peach” | “$3” | 55 |
“peach” | “$4” | 55 |
“clementine” | “$2.5” | 35 |
If we call fruits %>% duplicated()
, we would get the following vector:
>> [1] FALSE FALSE TRUE FALSE FALSE FALSE
We can see that the third row, which represents an "apple"
with price "$0.75"
and 95
calories, is a duplicate row. Every value in this row is the same as in another row (the previous row).
We can use the dplyr distinct()
function to remove all rows of a data frame that are duplicates of another row.
If we call fruits %>% distinct()
, we would get the table:
item | price | calories |
---|---|---|
“banana” | “$1” | 105 |
“apple” | “$0.75” | 95 |
“peach” | “$3” | 55 |
“peach” | “$4” | 55 |
“clementine” | “$2.5” | 35 |
The "apple"
row was deleted because it was exactly the same as another row. But the two "peach"
rows remain because there is a difference in the price column.
If we wanted to remove every row with a duplicate value in the item column, we could specify a subset
:
fruits %>% distinct(item,.keep_all=TRUE)
By default, this keeps the first occurrence of the duplicate:
item | price | calories |
---|---|---|
“banana” | “$1” | 105 |
“apple” | “$0.75” | 95 |
“peach” | “$3” | 55 |
“clementine” | “$2.5” | 35 |
Make sure that the columns you drop duplicates from are specifically the ones where duplicates don’t belong. You wouldn’t want to drop duplicates with the price
column as a subset, for example, because it’s okay if multiple items cost the same amount!
Instructions
The students
data frame has a column id
that is neither unique nor required for our analysis. Drop the id
column from the data frame and save the result to students
. View the head()
of students
.
It seems like in the data collection process, some rows may have been recorded twice. Use the duplicated()
function on the students
data frame to make a vector object called duplicates
.
table()
is a base R function that takes any R object as an argument and returns a table with the counts of each unique value in the object.
Pipe the result from the previous checkpoint into table()
to see how many rows are exact duplicates. Make sure to save the result to duplicate_counts
, and view duplicate_counts
.
Get rid of the duplicate rows in the students
data frame and save this new data frame as unique_students
.
Use the duplicated()
function again to make an object called updated_duplicates
after dropping the duplicates. Pipe the result into table()
to see if any duplicates remain, and view updated_duplicates
. Are there any TRUE
s left?