How to Drop Duplicates in Pandas and Keep the Row with the Highest Value
Sort the DataFrame by the value column in descending order, then call drop_duplicates(subset=['A'], keep='first') to retain the first occurrence—which corresponds to the highest value—for each duplicate group.
The drop_duplicates method in the pandas-dev/pandas repository provides a vectorized way to remove duplicate rows based on specific columns. While the method natively keeps the first or last occurrence encountered in the DataFrame, achieving conditional retention—such as keeping the row with the maximum value in another column—requires pre-sorting the data or using an alternative indexing strategy. This article explains how to implement both approaches using the actual source code from pandas/core/frame.py.
How drop_duplicates Works in pandas/core/frame.py
The DataFrame.drop_duplicates method is defined in pandas/core/frame.py (beginning around line 7633 in the current main branch). It accepts a subset argument specifying which columns to evaluate for duplicates, and a keep argument that controls which duplicate to retain ('first', 'last', or False to drop all instances).
By default, keep='first' retains the initial occurrence of each duplicate group based on the current row order. This behavior becomes the foundation for the sorting strategy: if you pre-sort the data so that the desired rows appear first, drop_duplicates will naturally retain them.
Sorting Strategy – Keep the Highest Value
To remove duplicates based on column A while retaining the row with the highest value in column B, first sort the DataFrame by column B in descending order. This positions the maximum values at the beginning of each group. Then invoke drop_duplicates with subset=['A'] and keep='first'.
import pandas as pd
# Sample data
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'baz'],
'B': [10, 20, 30, 15, 5],
'C': [1, 2, 3, 4, 5]
})
# Step 1: sort by B descending so the highest B per A comes first
df_sorted = df.sort_values('B', ascending=False)
# Step 2: drop duplicates on column A, keeping the first (i.e., highest B)
result = df_sorted.drop_duplicates(subset=['A'], keep='first')
print(result)
Output:
A B C
2 foo 30 3
1 bar 20 2
4 baz 5 5
Alternative Approach – GroupBy with idxmax
If preserving the original DataFrame order is critical, use groupby combined with idxmax to identify the index locations of maximum values without sorting. This method leverages pandas Index operations defined in pandas/core/indexes/base.py to retrieve rows efficiently.
import pandas as pd
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'baz'],
'B': [10, 20, 30, 15, 5],
'C': [1, 2, 3, 4, 5]
})
# Find the index of the max B for each A
idx = df.groupby('A')['B'].idxmax()
# Use those indices to locate the rows
result = df.loc[idx].reset_index(drop=True)
print(result)
Output:
A B C
0 foo 30 3
1 bar 20 2
2 baz 5 5
Both approaches are fully vectorized and avoid explicit Python loops.
Source Code Implementation Details
The duplicate removal logic relies on three core modules within the pandas codebase:
pandas/core/frame.py– Contains the mainDataFrame.drop_duplicatesmethod definition (around line 7633) that implements the subset and keep logic.pandas/core/generic.py– Provides the base class shared by DataFrame and Series, offering common API utilities leveraged bydrop_duplicates.pandas/core/indexes/base.py– Implements theIndex.drop_duplicatesmethod that underlies the DataFrame operation, handling the actual duplicate detection and removal at the index level.
These vectorized implementations avoid explicit Python loops, ensuring efficient execution on large datasets.
Summary
- Sort descending by the value column (B) before calling
drop_duplicates(subset=['A'], keep='first')to retain maximum values while removing duplicates. - Use
groupbywithidxmaxto select maximum-value rows without altering the original DataFrame order. - Reference
pandas/core/frame.pyfor the coredrop_duplicatesimplementation andpandas/core/indexes/base.pyfor index-level operations. - Both methods utilize vectorized operations that scale efficiently to large datasets.
Frequently Asked Questions
What is the difference between keep='first' and keep='last' in drop_duplicates?
The keep parameter in DataFrame.drop_duplicates determines which duplicate row to retain. When set to 'first' (the default), pandas keeps the first occurrence of each duplicate group encountered in the DataFrame. When set to 'last', it retains the final occurrence. Setting keep=False removes all duplicate rows entirely.
Can I use this method to keep the row with the minimum value instead?
Yes. To keep the row with the lowest value in column B, sort the DataFrame by column B with ascending=True (the default), then call drop_duplicates(subset=['A'], keep='first'). Alternatively, use df.groupby('A')['B'].idxmin() to identify indices of minimum values without sorting.
How does drop_duplicates handle ties when multiple rows have the same maximum value?
When multiple rows share the same maximum value in column B and the same key in column A, drop_duplicates retains only the first occurrence encountered after sorting (if using the sort method) or the first occurrence in the original DataFrame (if using idxmax). To control tie-breaking, add additional columns to the sort order or grouping keys.
Is there a performance difference between sorting and using groupby with idxmax?
The sorting approach requires an O(n log n) sort operation and modifies the DataFrame order, while the groupby + idxmax approach runs in O(n) time and preserves the original index order. For very large datasets where order preservation matters, groupby with idxmax is typically more efficient and memory-friendly.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md