How to Use Pandas Drop Duplicates with a Condition to Retain the Original Row
Use the keep='first' parameter in DataFrame.drop_duplicates() to retain the original occurrence, or pre-sort your DataFrame by a specific condition and use keep='last' to retain the row that best meets your criteria.
When working with the pandas-dev/pandas library, removing duplicate records while preserving specific rows based on custom business logic is a common data cleaning task. The pandas drop duplicates with condition technique allows you to control exactly which duplicate row is retained—whether it's the first occurrence, the last, or the row with the maximum value in another column—using the keep parameter or pre-sorting strategies.
Understanding the Keep Parameter in Pandas Drop Duplicates
The drop_duplicates() method in pandas/core/frame.py provides the keep parameter to determine which duplicate values to retain. According to the source code implementation, this parameter accepts three distinct values that directly control retention logic.
Retaining the First Occurrence (Original Row)
When you specify keep='first', pandas marks all duplicate occurrences as True except for the first instance. This is the default behavior and preserves what is typically considered the "original" row in the dataset.
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 2, 3],
'value': ['a', 'b', 'c', 'd']
})
# Retain the original (first) row for each duplicate
result = df.drop_duplicates(subset=['id'], keep='first')
Retaining the Last Occurrence Based on Position
Conversely, keep='last' retains the final occurrence of each duplicate set. This is useful when your data represents chronological updates and the most recent entry contains the authoritative information.
# Retain the most recent (last) row for each duplicate
result = df.drop_duplicates(subset=['id'], keep='last')
Implementing Custom Conditions with Pandas Drop Duplicates
While the keep parameter handles positional logic, true conditional retention—such as keeping the row with the highest score or latest timestamp—requires combining sort_values() with drop_duplicates().
Sorting Before Deduplication to Control Retention
To retain the row meeting a specific condition (e.g., maximum value in another column), first sort your DataFrame by that condition column in descending order, then use keep='first' to capture the top value.
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B'],
'score': [10, 20, 15, 25],
'data': ['x', 'y', 'z', 'w']
})
# Keep the row with the highest score in each group
df_sorted = df.sort_values('score', ascending=False)
result = df_sorted.drop_duplicates(subset=['group'], keep='first')
Using Boolean Indexing with duplicated() for Complex Conditions
For scenarios requiring logic beyond simple sorting, combine the duplicated() method with boolean indexing. This approach, implemented in the pandas source code at pandas/core/frame.py, returns a boolean Series that you can invert and combine with additional conditions.
# Identify duplicates while keeping the first occurrence
is_duplicate = df.duplicated(subset=['group'], keep='first')
# Complex condition: keep first duplicate OR rows with score > 15
mask = ~is_duplicate | (df['score'] > 15)
result = df[mask]
Source Code Implementation Details
The drop_duplicates() method is implemented in pandas/core/frame.py within the DataFrame class. The method signature reveals the core parameters controlling duplicate removal:
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | None = None,
keep: Literal["first"] | Literal["last"] | Literal[False] = "first",
inplace: bool = False,
ignore_index: bool = False,
) -> DataFrame | None:
According to the source code, the method internally calls self.duplicated(subset=subset, keep=keep) to generate a boolean mask, then inverts this mask (~mask) to select non-duplicate rows. When keep=False, all duplicate entries are removed entirely, leaving only unique observations.
Performance Considerations
When processing large datasets, the order of operations significantly impacts performance. Sorting before deduplication requires O(n log n) time complexity, while drop_duplicates alone operates in O(n) time. For datasets exceeding memory capacity, consider using subset parameter to limit column comparisons rather than processing the entire DataFrame.
Additionally, setting inplace=True modifies the DataFrame without creating a copy, reducing memory overhead. However, according to the implementation in pandas/core/frame.py, the inplace operation still creates an intermediate copy internally before reassignment.
Summary
- Use
keep='first'indrop_duplicates()to retain the original (first) occurrence of duplicate rows. - Use
keep='last'to retain the final occurrence when working with time-series or updated records. - For conditional retention based on other column values (e.g., keeping the row with maximum score), sort by the condition column first, then apply
drop_duplicates(). - The
duplicated()method provides boolean masking for complex, multi-condition deduplication logic. - The implementation resides in
pandas/core/frame.pyand operates by inverting a boolean mask generated from hash-based duplicate detection.
Frequently Asked Questions
What is the difference between keep='first' and keep='last' in pandas drop_duplicates?
The keep='first' parameter marks all duplicate instances as True except for the initial occurrence, effectively retaining the first (original) row encountered in the DataFrame. Conversely, keep='last' marks all duplicates except the final occurrence, retaining the last row in the sequence. According to the source code in pandas/core/frame.py, this logic is implemented by iterating through hash tables and flagging positions based on their order of appearance.
How do I drop duplicates but keep the row with the maximum value in another column?
To retain the row with the maximum value in a specific column while removing duplicates, first sort your DataFrame by that column in descending order using sort_values(), then apply drop_duplicates() with keep='first'. For example: df.sort_values('score', ascending=False).drop_duplicates('group', keep='first'). This technique leverages the fact that drop_duplicates retains the first encountered row after sorting places the desired maximum value at the top of each duplicate group.
Can I use drop_duplicates with a custom function instead of the keep parameter?
While drop_duplicates() does not accept arbitrary functions for the keep parameter, you can achieve custom conditional logic by combining groupby() with idxmax(), idxmin(), or apply(). For instance, to keep rows based on a complex condition, use df.loc[df.groupby('key')['value'].idxmax()] instead of drop_duplicates. This approach provides the flexibility of custom functions while maintaining the performance benefits of vectorized pandas operations.
Does drop_duplicates modify the original DataFrame in place?
By default, drop_duplicates() returns a new DataFrame and leaves the original unchanged. However, you can modify the original DataFrame in place by setting the inplace=True parameter, which updates the existing object without creating a copy. According to the implementation in pandas/core/frame.py, even when inplace=True, the method internally creates a deduplicated copy before reassigning it to the original DataFrame's underlying data structure.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md