Why drop duplicates pandas Is Not Removing Rows from Your DataFrame

DataFrame.drop_duplicates() removes only rows that are truly identical based on the criteria you provide, and architectural details such as the default keep='first' setting, missing subset definitions, and inplace=False behavior are the most common reasons rows appear to remain.

In the pandas-dev/pandas repository, the drop_duplicates() method defined in pandas/core/frame.py follows strict rules for identifying and removing duplicate records. When this method appears to fail, it is usually because the user's expectations conflict with pandas' default handling of duplicate groups, column comparisons, or object mutation.

Common Reasons drop duplicates pandas Appears to Fail

Default keep='first' Preserves Initial Occurrences

In pandas/core/frame.py, the DataFrame.drop_duplicates method delegates to the internal _drop_duplicates helper, where the keep argument is evaluated. By default, keep='first' retains the initial appearance of each duplicate group and removes only subsequent matches. If you expect all duplicate rows to disappear, you must explicitly set keep=False.

Unspecified subset Compares All Columns

When you do not provide the subset parameter, pandas compares every column to determine uniqueness. As implemented in the logic that utilizes Index.drop_duplicates from pandas/core/indexes/base.py, rows are considered distinct if any column differs, even when the specific fields you care about are duplicated.

inplace=False Returns a New DataFrame

The drop_duplicates implementation creates a new DataFrame via self._constructor and returns it when inplace=False (the default). The original DataFrame remains unmodified in memory, which causes the method to appear ineffective if you do not assign the result to a variable.

NaN Values Are Never Considered Equal

According to the source implementation, the duplicate mask is built using logic where nan_equal=False. This means NaN values are never treated as equal to one another, so rows that differ only by NaN values will not be flagged as duplicates and will not be dropped.

ignore_index Resets Row Labels

When ignore_index=True is specified, the final step in _drop_duplicates resets the index of the result. This can create the illusion that original rows were retained because the returned index no longer maps to the original positions, even though duplicate data was actually removed.

Implementation in pandas/core/frame.py

The behavior of drop duplicates pandas is governed by two key locations in the source code:

  • pandas/core/frame.py (lines ~7633): The public DataFrame.drop_duplicates API parses arguments and delegates to the internal helper.
  • pandas/core/frame.py (lines ~7653): The _drop_duplicates method builds the duplicate mask using Series.duplicated and applies the keep, subset, and ignore_index options.
  • pandas/core/indexes/base.py: Contains the low-level Index.drop_duplicates logic used for index-level deduplication.

Practical Examples

import pandas as pd

df = pd.DataFrame({
    "A": [1, 1, 2, 2],
    "B": ["x", "x", "y", "z"],
    "C": [10, 10, 20, 20]
})

# 1️⃣ Keep the first occurrence (default) → only the second “(1, x, 10)” is dropped

print(df.drop_duplicates())

#    A  B   C

# 0  1  x  10

# 2  2  y  20

# 3  2  z  20

# 2️⃣ Drop **all** duplicates

print(df.drop_duplicates(keep=False))

#    A  B   C

# 2  2  y  20

# 3  2  z  20

# 3️⃣ Consider only column 'A' for duplication

print(df.drop_duplicates(subset=["A"]))

#    A  B   C

# 0  1  x  10

# 2  2  y  20

# 4️⃣ Modify the original DataFrame in place

df.drop_duplicates(inplace=True)
print(df)

#    A  B   C

# 0  1  x  10

# 2  2  y  20

# 3  2  z  20

# 5️⃣ Rows with NaN are never treated as duplicates

df_nan = pd.DataFrame({"A": [1, 1, None], "B": [None, None, None]})
print(df_nan.drop_duplicates())

#      A     B

# 0  1.0  None

# 2  NaN   None

Summary

  • keep='first' retains the first duplicate by default; use keep=False to remove all instances of duplicated rows.
  • Without subset, pandas checks all columns for equality, not just the columns you intended.
  • inplace=False returns a new DataFrame; assign the result to a variable or use inplace=True to modify the original.
  • NaN values are never treated as equal, preventing their deduplication even when they appear in identical rows.
  • ignore_index=True resets the index, which can obscure which rows were actually removed versus which were re-labeled.

Frequently Asked Questions

Why does drop_duplicates keep the first row by default?

The keep='first' default in DataFrame.drop_duplicates is designed to preserve the original occurrence of each unique group while eliminating subsequent duplicates. This behavior is hardcoded in pandas/core/frame.py within the _drop_duplicates helper method to maintain data stability, ensuring you do not lose the first instance of important information.

Why aren't NaN values treated as duplicates in pandas?

According to the implementation in pandas/core/frame.py, the duplicate mask is computed with nan_equal=False, meaning floating-point "Not a Number" values are never considered equal to one another. This IEEE-standard behavior prevents rows containing NaN from being collapsed into a single entry, even when all other values match.

Does drop_duplicates modify the DataFrame in place?

By default, inplace=False, so drop_duplicates returns a new DataFrame and leaves the original object untouched. You must either assign the returned value to a variable (e.g., df = df.drop_duplicates()) or pass inplace=True to modify the existing DataFrame as implemented in the self._constructor logic.

How do I remove duplicates based only on specific columns?

Pass a list of column names to the subset parameter, such as df.drop_duplicates(subset=['column_a', 'column_b']). This restricts the uniqueness check to only those columns, leveraging the underlying logic in pandas/core/indexes/base.py to ignore differences in other fields.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s https://instagit.com/install.md

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client