How to Remove Duplicate Rows Across Multiple Columns in pandas
Use the DataFrame.drop_duplicates() method with the subset parameter to define which columns to evaluate and the keep parameter to control whether to retain the first occurrence, last occurrence, or no occurrences of duplicate rows.
The pandas-dev/pandas library provides a high-performance API to remove duplicate rows through the drop_duplicates method, which leverages optimized Cython hash tables to process millions of rows efficiently. Whether you need to deduplicate based on a single column or across multiple columns simultaneously, this method offers precise control over which rows persist in your DataFrame.
Architecture Behind drop_duplicates
DataFrame.drop_duplicates Public API
The entry point for removing duplicates resides in pandas/core/frame.py, where the DataFrame.drop_duplicates method accepts the subset, keep, inplace, and ignore_index parameters. This public wrapper validates user inputs and delegates to the generic implementation shared across pandas data structures.
NDFrame._drop_duplicates Implementation
The core logic lives in pandas/core/generic.py within the NDFrame._drop_duplicates method. As implemented in pandas-dev/pandas, this shared routine handles both Series and DataFrame objects by constructing a hash table from the specified columns (or the entire row if subset is None) and determining which indices survive the deduplication process based on the keep strategy.
Index-Level Duplicate Handling
When your subset includes index levels, pandas/core/indexes/base.py contains Index.drop_duplicates, which applies the same duplicate-removal semantics specifically toIndex objects. This ensures consistent behavior whether you are filtering rows or index values.
Cython Hash Table Engine
The computationally intensive duplicate detection occurs in pandas/_libs/hashtable_functions_helper.pxi, where Cython assembles the hash table. This low-level implementation concatenates the selected column values row-wise and returns a boolean mask indicating which rows to keep, delivering high-speed performance on large datasets.
Method Parameters for Multi-Column Deduplication
To remove duplicate rows across multiple columns, configure these parameters:
subset– Accepts a single label, list-like, or NumPy array of column names to consider. When omitted, pandas evaluates all columns.keep– Controls which duplicate to retain:'first'(default),'last', orFalse(drop all rows that have duplicates).inplace– WhenTrue, modifies the original DataFrame and returnsNoneinstead of a new object.ignore_index– WhenTrue, resets the index to a freshRangeIndex(0, 1, 2...) after removing rows, eliminating gaps from deleted indices.
Practical Code Examples
import pandas as pd
# Sample data with duplicates across columns A and B
df = pd.DataFrame({
"A": [1, 1, 2, 2, 3],
"B": ["x", "x", "y", "y", "z"],
"C": [10, 10, 20, 30, 40]
})
Remove Duplicates Across Specific Columns
Pass a list of column names to subset to evaluate uniqueness based only on those columns:
# Keep first occurrence of duplicates in columns A and B
df_unique = df.drop_duplicates(subset=["A", "B"])
print(df_unique)
Output:
A B C
0 1 x 10
2 2 y 20
4 3 z 40
Retain the Last Occurrence
Use keep='last' to preserve the final instance of each duplicate group rather than the first:
df_last = df.drop_duplicates(subset=["A", "B"], keep="last")
print(df_last)
Output:
A B C
1 1 x 10
3 2 y 30
4 3 z 40
Drop All Instances of Duplicates
Set keep=False to remove every row that appears more than once, leaving only unique rows:
df_strict = df.drop_duplicates(subset=["A", "B"], keep=False)
print(df_strict)
Output:
A B C
4 3 z 40
In-Place Removal with Index Reset
Combine inplace=True and ignore_index=True to modify the original DataFrame and receive a clean sequential index:
df.drop_duplicates(subset=["A", "B"], inplace=True, ignore_index=True)
print(df)
Output:
A B C
0 1 x 10
1 2 y 20
2 3 z 40
Summary
- Use
DataFrame.drop_duplicates(defined inpandas/core/frame.py) to remove duplicate rows efficiently across one or more columns. - Specify
subsetas a list of column labels to limit duplicate detection to specific dimensions; omit it to consider the entire row. - Control retention policy with
keep='first','last', orFalse(drop all duplicates entirely). - Leverage
ignore_index=Trueto generate a freshRangeIndexafter row removal, avoiding non-sequential indices. - The heavy lifting happens in Cython (
pandas/_libs/hashtable_functions_helper.pxi) via the sharedNDFrame._drop_duplicateslogic inpandas/core/generic.py.
Frequently Asked Questions
How do I remove duplicate rows based on specific columns only?
Use the subset parameter. Pass a list of column names to subset to instruct pandas to evaluate uniqueness only across those columns, ignoring values in other columns. For example: df.drop_duplicates(subset=['col1', 'col2']).
What is the difference between keep='first' and keep='last'?
The position of the retained row within each duplicate group. When keep='first' (the default), pandas retains the initial occurrence of each duplicate and removes subsequent matches. When keep='last', pandas retains the final occurrence and removes earlier matches, effectively keeping the bottom-most duplicate in the DataFrame.
How can I drop all instances of duplicate rows completely?
Set keep=False. This option drops every row that has one or more duplicates, returning a DataFrame containing only rows that appear exactly once in the original data. No instances of the duplicated combinations remain in the result.
Does drop_duplicates modify the original DataFrame?
Only if you specify inplace=True. By default, drop_duplicates returns a new DataFrame and leaves the original unchanged. When inplace=True, the method modifies the original object in-memory and returns None, similar to other pandas in-place operations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md