How to Remove Duplicate Rows Across Multiple Columns in pandas

Use the DataFrame.drop_duplicates() method with the subset parameter to define which columns to evaluate and the keep parameter to control whether to retain the first occurrence, last occurrence, or no occurrences of duplicate rows.

The pandas-dev/pandas library provides a high-performance API to remove duplicate rows through the drop_duplicates method, which leverages optimized Cython hash tables to process millions of rows efficiently. Whether you need to deduplicate based on a single column or across multiple columns simultaneously, this method offers precise control over which rows persist in your DataFrame.

Architecture Behind drop_duplicates

DataFrame.drop_duplicates Public API

The entry point for removing duplicates resides in pandas/core/frame.py, where the DataFrame.drop_duplicates method accepts the subset, keep, inplace, and ignore_index parameters. This public wrapper validates user inputs and delegates to the generic implementation shared across pandas data structures.

NDFrame._drop_duplicates Implementation

The core logic lives in pandas/core/generic.py within the NDFrame._drop_duplicates method. As implemented in pandas-dev/pandas, this shared routine handles both Series and DataFrame objects by constructing a hash table from the specified columns (or the entire row if subset is None) and determining which indices survive the deduplication process based on the keep strategy.

Index-Level Duplicate Handling

When your subset includes index levels, pandas/core/indexes/base.py contains Index.drop_duplicates, which applies the same duplicate-removal semantics specifically toIndex objects. This ensures consistent behavior whether you are filtering rows or index values.

Cython Hash Table Engine

The computationally intensive duplicate detection occurs in pandas/_libs/hashtable_functions_helper.pxi, where Cython assembles the hash table. This low-level implementation concatenates the selected column values row-wise and returns a boolean mask indicating which rows to keep, delivering high-speed performance on large datasets.

Method Parameters for Multi-Column Deduplication

To remove duplicate rows across multiple columns, configure these parameters:

  • subset – Accepts a single label, list-like, or NumPy array of column names to consider. When omitted, pandas evaluates all columns.
  • keep – Controls which duplicate to retain: 'first' (default), 'last', or False (drop all rows that have duplicates).
  • inplace – When True, modifies the original DataFrame and returns None instead of a new object.
  • ignore_index – When True, resets the index to a fresh RangeIndex (0, 1, 2...) after removing rows, eliminating gaps from deleted indices.

Practical Code Examples

import pandas as pd

# Sample data with duplicates across columns A and B

df = pd.DataFrame({
    "A": [1, 1, 2, 2, 3],
    "B": ["x", "x", "y", "y", "z"],
    "C": [10, 10, 20, 30, 40]
})

Remove Duplicates Across Specific Columns

Pass a list of column names to subset to evaluate uniqueness based only on those columns:


# Keep first occurrence of duplicates in columns A and B

df_unique = df.drop_duplicates(subset=["A", "B"])
print(df_unique)

Output:


   A  B   C
0  1  x  10
2  2  y  20
4  3  z  40

Retain the Last Occurrence

Use keep='last' to preserve the final instance of each duplicate group rather than the first:

df_last = df.drop_duplicates(subset=["A", "B"], keep="last")
print(df_last)

Output:


   A  B   C
1  1  x  10
3  2  y  30
4  3  z  40

Drop All Instances of Duplicates

Set keep=False to remove every row that appears more than once, leaving only unique rows:

df_strict = df.drop_duplicates(subset=["A", "B"], keep=False)
print(df_strict)

Output:


   A  B   C
4  3  z  40

In-Place Removal with Index Reset

Combine inplace=True and ignore_index=True to modify the original DataFrame and receive a clean sequential index:

df.drop_duplicates(subset=["A", "B"], inplace=True, ignore_index=True)
print(df)

Output:


   A  B   C
0  1  x  10
1  2  y  20
2  3  z  40

Summary

  • Use DataFrame.drop_duplicates (defined in pandas/core/frame.py) to remove duplicate rows efficiently across one or more columns.
  • Specify subset as a list of column labels to limit duplicate detection to specific dimensions; omit it to consider the entire row.
  • Control retention policy with keep='first', 'last', or False (drop all duplicates entirely).
  • Leverage ignore_index=True to generate a fresh RangeIndex after row removal, avoiding non-sequential indices.
  • The heavy lifting happens in Cython (pandas/_libs/hashtable_functions_helper.pxi) via the shared NDFrame._drop_duplicates logic in pandas/core/generic.py.

Frequently Asked Questions

How do I remove duplicate rows based on specific columns only?

Use the subset parameter. Pass a list of column names to subset to instruct pandas to evaluate uniqueness only across those columns, ignoring values in other columns. For example: df.drop_duplicates(subset=['col1', 'col2']).

What is the difference between keep='first' and keep='last'?

The position of the retained row within each duplicate group. When keep='first' (the default), pandas retains the initial occurrence of each duplicate and removes subsequent matches. When keep='last', pandas retains the final occurrence and removes earlier matches, effectively keeping the bottom-most duplicate in the DataFrame.

How can I drop all instances of duplicate rows completely?

Set keep=False. This option drops every row that has one or more duplicates, returning a DataFrame containing only rows that appear exactly once in the original data. No instances of the duplicated combinations remain in the result.

Does drop_duplicates modify the original DataFrame?

Only if you specify inplace=True. By default, drop_duplicates returns a new DataFrame and leaves the original unchanged. When inplace=True, the method modifies the original object in-memory and returns None, similar to other pandas in-place operations.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s https://instagit.com/install.md

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client