Python:Pandas .drop_duplicates()

MamtaWardhani's avatar
Published Mar 18, 2023Updated May 23, 2025
Contribute to Docs

The .drop_duplicates() method in Pandas is used to remove duplicate rows from a DataFrame. This method identifies and eliminates duplicate rows across all columns or based on specified columns, helping to clean and preprocess data for analysis. When working with real-world datasets, duplicate records often arise due to data entry errors, merging datasets, or data collection issues, making .drop_duplicates() an essential tool in a data scientist’s toolkit.

By default, .drop_duplicates() retains the first occurrence of each duplicate row and removes all subsequent duplicates. This method is particularly useful in data cleaning pipelines, ensuring data integrity for accurate analysis, and reducing memory usage by eliminating redundant information.

  • Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.
    • Includes 27 Courses
    • With Professional Certification
    • Beginner Friendly.
      95 hours
  • Learn the basics of Python 3.12, one of the most powerful, versatile, and in-demand programming languages today.
    • With Certificate
    • Beginner Friendly.
      24 hours

Syntax

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

Parameters:

  • subset: A column label or a list of column labels to consider for identifying duplicates. Default is None, which uses all columns.
  • keep: Determines which duplicates to keep. Options include:
    • 'first': Keep the first occurrence (default)
    • 'last': Keep the last occurrence
    • False: Drop all duplicates
  • inplace: Boolean value that determines whether to modify the DataFrame directly (True) or return a new DataFrame with duplicates removed (False). Default is False.
  • ignore_index: Boolean value that, when True, resets the index of the resulting DataFrame. Default is False.

Return value:

Returns a DataFrame with duplicate rows removed or None if inplace=True.

Example 1: Basic Duplicate Removal

The following example demonstrates how to remove duplicate rows from a DataFrame containing customer information:

# Import pandas library
import pandas as pd
# Create a sample DataFrame with duplicated rows
data = {
'Name': ['Alice', 'Bob', 'Alice', 'David'],
'Age': [25, 30, 25, 40],
'City': ['NY', 'LA', 'NY', 'Chicago']
}
# Create DataFrame
df = pd.DataFrame(data)
# Display the original DataFrame
print("Original DataFrame:")
print(df)
# Remove duplicates and create a new DataFrame
unique_df = df.drop_duplicates()
# Display the DataFrame with duplicates removed
print("\nDataFrame after removing duplicates:")
print(unique_df)

The output generated by the above code will be:

Original DataFrame:
Name Age City
0 Alice 25 NY
1 Bob 30 LA
2 Alice 25 NY
3 David 40 Chicago
DataFrame after removing duplicates:
Name Age City
0 Alice 25 NY
1 Bob 30 LA
3 David 40 Chicago

In this example, the second row with ‘Alice’ from ‘NY’ is removed because it’s a duplicate of the first row. By default, .drop_duplicates() keeps the first occurrence of each unique row.

Example 2: Removing Duplicates Based on Specific Columns

This example shows how to remove duplicates based on specific columns in a customer order database:

# Import pandas library
import pandas as pd
# Create a sample order DataFrame
orders = pd.DataFrame({
'CustomerID': [101, 102, 101, 103, 104, 103],
'OrderDate': ['2023-01-15', '2023-01-20', '2023-02-10', '2023-02-15', '2023-03-01', '2023-03-05'],
'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Headphones', 'Monitor'],
'Amount': [1200, 800, 500, 1100, 150, 350]
})
# Display the original DataFrame
print("Original Orders DataFrame:")
print(orders)
# Remove duplicates based only on CustomerID (keep first order from each customer)
unique_customers = orders.drop_duplicates(subset=['CustomerID'])
# Display the result
print("\nOrders after removing duplicates based on CustomerID:")
print(unique_customers)
# Remove duplicates based on multiple columns (CustomerID and Product)
unique_product_orders = orders.drop_duplicates(subset=['CustomerID', 'Product'])
# Display the result
print("\nOrders after removing duplicates based on CustomerID and Product:")
print(unique_product_orders)

The output produced by the above code will be:

Original Orders DataFrame:
CustomerID OrderDate Product Amount
0 101 2023-01-15 Laptop 1200
1 102 2023-01-20 Phone 800
2 101 2023-02-10 Tablet 500
3 103 2023-02-15 Laptop 1100
4 104 2023-03-01 Headphones 150
5 103 2023-03-05 Monitor 350
Orders after removing duplicates based on CustomerID:
CustomerID OrderDate Product Amount
0 101 2023-01-15 Laptop 1200
1 102 2023-01-20 Phone 800
3 103 2023-02-15 Laptop 1100
4 104 2023-03-01 Headphones 150
Orders after removing duplicates based on CustomerID and Product:
CustomerID OrderDate Product Amount
0 101 2023-01-15 Laptop 1200
1 102 2023-01-20 Phone 800
2 101 2023-02-10 Tablet 500
3 103 2023-02-15 Laptop 1100
4 104 2023-03-01 Headphones 150
5 103 2023-03-05 Monitor 350

In this example, duplicates are first removed based solely on the ‘CustomerID’ column, keeping only the first order from each customer. Then, duplicates are removed based on both ‘CustomerID’ and ‘Product’, keeping unique product orders for each customer.

Codebyte Example: Working with Different Keep Options

This example demonstrates the various keep options when working with sensor data collection:

Code
Output

This example illustrates the different keep options:

  • With keep='first', the earliest reading from each sensor is retained
  • With keep='last', the latest reading from each sensor is kept
  • With keep=False, no duplicates are retained (empty DataFrame since all sensors have multiple readings)
  • Using inplace=True modifies the original DataFrame directly

Frequently Asked Questions

1. Does .drop_duplicates() consider all columns by default?

Yes, by default, .drop_duplicates() considers all columns when identifying duplicates. To focus on specific columns, use the subset parameter.

2. How does .drop_duplicates() handle NaN values?

NaN values are considered equal to other NaN values when comparing rows for duplicates.

3. What is the time complexity of .drop_duplicates()?

The time complexity is approximately O(n), where n is the number of rows, although it can vary based on the number of columns being considered.

4. Can I use .drop_duplicates() with multiIndex DataFrames?

Yes, .drop_duplicates() works with multiIndex DataFrames. You can specify index levels in the subset parameter if needed.

5. How does the ignore_index parameter work?

When ignore_index=True, the resulting DataFrame will have a new index with consecutive integers starting from 0, regardless of the original index values.

All contributors

Contribute to Docs

Learn Python:Pandas on Codecademy

  • Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.
    • Includes 27 Courses
    • With Professional Certification
    • Beginner Friendly.
      95 hours
  • Learn the basics of Python 3.12, one of the most powerful, versatile, and in-demand programming languages today.
    • With Certificate
    • Beginner Friendly.
      24 hours