Why drop duplicates pandas Is Not Removing Rows from Your DataFrame
DataFrame.drop_duplicates() removes only rows that are truly identical based on the criteria you provide, and architectural details such as the default keep='first' setting, missing subset definitions, and inplace=False behavior are the most common reasons rows appear to remain.
In the pandas-dev/pandas repository, the drop_duplicates() method defined in pandas/core/frame.py follows strict rules for identifying and removing duplicate records. When this method appears to fail, it is usually because the user's expectations conflict with pandas' default handling of duplicate groups, column comparisons, or object mutation.
Common Reasons drop duplicates pandas Appears to Fail
Default keep='first' Preserves Initial Occurrences
In pandas/core/frame.py, the DataFrame.drop_duplicates method delegates to the internal _drop_duplicates helper, where the keep argument is evaluated. By default, keep='first' retains the initial appearance of each duplicate group and removes only subsequent matches. If you expect all duplicate rows to disappear, you must explicitly set keep=False.
Unspecified subset Compares All Columns
When you do not provide the subset parameter, pandas compares every column to determine uniqueness. As implemented in the logic that utilizes Index.drop_duplicates from pandas/core/indexes/base.py, rows are considered distinct if any column differs, even when the specific fields you care about are duplicated.
inplace=False Returns a New DataFrame
The drop_duplicates implementation creates a new DataFrame via self._constructor and returns it when inplace=False (the default). The original DataFrame remains unmodified in memory, which causes the method to appear ineffective if you do not assign the result to a variable.
NaN Values Are Never Considered Equal
According to the source implementation, the duplicate mask is built using logic where nan_equal=False. This means NaN values are never treated as equal to one another, so rows that differ only by NaN values will not be flagged as duplicates and will not be dropped.
ignore_index Resets Row Labels
When ignore_index=True is specified, the final step in _drop_duplicates resets the index of the result. This can create the illusion that original rows were retained because the returned index no longer maps to the original positions, even though duplicate data was actually removed.
Implementation in pandas/core/frame.py
The behavior of drop duplicates pandas is governed by two key locations in the source code:
pandas/core/frame.py(lines ~7633): The publicDataFrame.drop_duplicatesAPI parses arguments and delegates to the internal helper.pandas/core/frame.py(lines ~7653): The_drop_duplicatesmethod builds the duplicate mask usingSeries.duplicatedand applies thekeep,subset, andignore_indexoptions.pandas/core/indexes/base.py: Contains the low-levelIndex.drop_duplicateslogic used for index-level deduplication.
Practical Examples
import pandas as pd
df = pd.DataFrame({
"A": [1, 1, 2, 2],
"B": ["x", "x", "y", "z"],
"C": [10, 10, 20, 20]
})
# 1️⃣ Keep the first occurrence (default) → only the second “(1, x, 10)” is dropped
print(df.drop_duplicates())
# A B C
# 0 1 x 10
# 2 2 y 20
# 3 2 z 20
# 2️⃣ Drop **all** duplicates
print(df.drop_duplicates(keep=False))
# A B C
# 2 2 y 20
# 3 2 z 20
# 3️⃣ Consider only column 'A' for duplication
print(df.drop_duplicates(subset=["A"]))
# A B C
# 0 1 x 10
# 2 2 y 20
# 4️⃣ Modify the original DataFrame in place
df.drop_duplicates(inplace=True)
print(df)
# A B C
# 0 1 x 10
# 2 2 y 20
# 3 2 z 20
# 5️⃣ Rows with NaN are never treated as duplicates
df_nan = pd.DataFrame({"A": [1, 1, None], "B": [None, None, None]})
print(df_nan.drop_duplicates())
# A B
# 0 1.0 None
# 2 NaN None
Summary
keep='first'retains the first duplicate by default; usekeep=Falseto remove all instances of duplicated rows.- Without
subset, pandas checks all columns for equality, not just the columns you intended. inplace=Falsereturns a new DataFrame; assign the result to a variable or useinplace=Trueto modify the original.NaNvalues are never treated as equal, preventing their deduplication even when they appear in identical rows.ignore_index=Trueresets the index, which can obscure which rows were actually removed versus which were re-labeled.
Frequently Asked Questions
Why does drop_duplicates keep the first row by default?
The keep='first' default in DataFrame.drop_duplicates is designed to preserve the original occurrence of each unique group while eliminating subsequent duplicates. This behavior is hardcoded in pandas/core/frame.py within the _drop_duplicates helper method to maintain data stability, ensuring you do not lose the first instance of important information.
Why aren't NaN values treated as duplicates in pandas?
According to the implementation in pandas/core/frame.py, the duplicate mask is computed with nan_equal=False, meaning floating-point "Not a Number" values are never considered equal to one another. This IEEE-standard behavior prevents rows containing NaN from being collapsed into a single entry, even when all other values match.
Does drop_duplicates modify the DataFrame in place?
By default, inplace=False, so drop_duplicates returns a new DataFrame and leaves the original object untouched. You must either assign the returned value to a variable (e.g., df = df.drop_duplicates()) or pass inplace=True to modify the existing DataFrame as implemented in the self._constructor logic.
How do I remove duplicates based only on specific columns?
Pass a list of column names to the subset parameter, such as df.drop_duplicates(subset=['column_a', 'column_b']). This restricts the uniqueness check to only those columns, leveraging the underlying logic in pandas/core/indexes/base.py to ignore differences in other fields.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md