Python:Pandas .drop_duplicates()
The .drop_duplicates() method in Pandas is used to remove duplicate rows from a DataFrame. This method identifies and eliminates duplicate rows across all columns or based on specified columns, helping to clean and preprocess data for analysis. When working with real-world datasets, duplicate records often arise due to data entry errors, merging datasets, or data collection issues, making .drop_duplicates() an essential tool in a data scientist’s toolkit.
By default, .drop_duplicates() retains the first occurrence of each duplicate row and removes all subsequent duplicates. This method is particularly useful in data cleaning pipelines, ensuring data integrity for accurate analysis, and reducing memory usage by eliminating redundant information.
Syntax
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
Parameters:
subset: A column label or a list of column labels to consider for identifying duplicates. Default isNone, which uses all columns.keep: Determines which duplicates to keep. Options include:'first': Keep the first occurrence (default)'last': Keep the last occurrenceFalse: Drop all duplicates
inplace: Boolean value that determines whether to modify theDataFramedirectly (True) or return a newDataFramewith duplicates removed (False). Default isFalse.ignore_index: Boolean value that, whenTrue, resets the index of the resultingDataFrame. Default isFalse.
Return value:
Returns a DataFrame with duplicate rows removed or None if inplace=True.
Example 1: Basic Duplicate Removal
The following example demonstrates how to remove duplicate rows from a DataFrame containing customer information:
# Import pandas libraryimport pandas as pd# Create a sample DataFrame with duplicated rowsdata = {'Name': ['Alice', 'Bob', 'Alice', 'David'],'Age': [25, 30, 25, 40],'City': ['NY', 'LA', 'NY', 'Chicago']}# Create DataFramedf = pd.DataFrame(data)# Display the original DataFrameprint("Original DataFrame:")print(df)# Remove duplicates and create a new DataFrameunique_df = df.drop_duplicates()# Display the DataFrame with duplicates removedprint("\nDataFrame after removing duplicates:")print(unique_df)
The output generated by the above code will be:
Original DataFrame:Name Age City0 Alice 25 NY1 Bob 30 LA2 Alice 25 NY3 David 40 ChicagoDataFrame after removing duplicates:Name Age City0 Alice 25 NY1 Bob 30 LA3 David 40 Chicago
In this example, the second row with ‘Alice’ from ‘NY’ is removed because it’s a duplicate of the first row. By default, .drop_duplicates() keeps the first occurrence of each unique row.
Example 2: Removing Duplicates Based on Specific Columns
This example shows how to remove duplicates based on specific columns in a customer order database:
# Import pandas libraryimport pandas as pd# Create a sample order DataFrameorders = pd.DataFrame({'CustomerID': [101, 102, 101, 103, 104, 103],'OrderDate': ['2023-01-15', '2023-01-20', '2023-02-10', '2023-02-15', '2023-03-01', '2023-03-05'],'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Headphones', 'Monitor'],'Amount': [1200, 800, 500, 1100, 150, 350]})# Display the original DataFrameprint("Original Orders DataFrame:")print(orders)# Remove duplicates based only on CustomerID (keep first order from each customer)unique_customers = orders.drop_duplicates(subset=['CustomerID'])# Display the resultprint("\nOrders after removing duplicates based on CustomerID:")print(unique_customers)# Remove duplicates based on multiple columns (CustomerID and Product)unique_product_orders = orders.drop_duplicates(subset=['CustomerID', 'Product'])# Display the resultprint("\nOrders after removing duplicates based on CustomerID and Product:")print(unique_product_orders)
The output produced by the above code will be:
Original Orders DataFrame:CustomerID OrderDate Product Amount0 101 2023-01-15 Laptop 12001 102 2023-01-20 Phone 8002 101 2023-02-10 Tablet 5003 103 2023-02-15 Laptop 11004 104 2023-03-01 Headphones 1505 103 2023-03-05 Monitor 350Orders after removing duplicates based on CustomerID:CustomerID OrderDate Product Amount0 101 2023-01-15 Laptop 12001 102 2023-01-20 Phone 8003 103 2023-02-15 Laptop 11004 104 2023-03-01 Headphones 150Orders after removing duplicates based on CustomerID and Product:CustomerID OrderDate Product Amount0 101 2023-01-15 Laptop 12001 102 2023-01-20 Phone 8002 101 2023-02-10 Tablet 5003 103 2023-02-15 Laptop 11004 104 2023-03-01 Headphones 1505 103 2023-03-05 Monitor 350
In this example, duplicates are first removed based solely on the ‘CustomerID’ column, keeping only the first order from each customer. Then, duplicates are removed based on both ‘CustomerID’ and ‘Product’, keeping unique product orders for each customer.
Codebyte Example: Working with Different Keep Options
This example demonstrates the various keep options when working with sensor data collection:
This example illustrates the different keep options:
- With
keep='first', the earliest reading from each sensor is retained - With
keep='last', the latest reading from each sensor is kept - With
keep=False, no duplicates are retained (empty DataFrame since all sensors have multiple readings) - Using
inplace=Truemodifies the original DataFrame directly
Frequently Asked Questions
1. Does .drop_duplicates() consider all columns by default?
Yes, by default, .drop_duplicates() considers all columns when identifying duplicates. To focus on specific columns, use the subset parameter.
2. How does .drop_duplicates() handle NaN values?
NaN values are considered equal to other NaN values when comparing rows for duplicates.
3. What is the time complexity of .drop_duplicates()?
The time complexity is approximately O(n), where n is the number of rows, although it can vary based on the number of columns being considered.
4. Can I use .drop_duplicates() with multiIndex DataFrames?
Yes, .drop_duplicates() works with multiIndex DataFrames. You can specify index levels in the subset parameter if needed.
5. How does the ignore_index parameter work?
When ignore_index=True, the resulting DataFrame will have a new index with consecutive integers starting from 0, regardless of the original index values.
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn Python:Pandas on Codecademy
- Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.
- Includes 27 Courses
- With Professional Certification
- Beginner Friendly.95 hours
- Learn the basics of Python 3.12, one of the most powerful, versatile, and in-demand programming languages today.
- With Certificate
- Beginner Friendly.24 hours