How to Use Pandas Drop Duplicates with a Condition to Retain the Original Row

Use the keep='first' parameter in DataFrame.drop_duplicates() to retain the original occurrence, or pre-sort your DataFrame by a specific condition and use keep='last' to retain the row that best meets your criteria.

When working with the pandas-dev/pandas library, removing duplicate records while preserving specific rows based on custom business logic is a common data cleaning task. The pandas drop duplicates with condition technique allows you to control exactly which duplicate row is retained—whether it's the first occurrence, the last, or the row with the maximum value in another column—using the keep parameter or pre-sorting strategies.

Understanding the Keep Parameter in Pandas Drop Duplicates

The drop_duplicates() method in pandas/core/frame.py provides the keep parameter to determine which duplicate values to retain. According to the source code implementation, this parameter accepts three distinct values that directly control retention logic.

Retaining the First Occurrence (Original Row)

When you specify keep='first', pandas marks all duplicate occurrences as True except for the first instance. This is the default behavior and preserves what is typically considered the "original" row in the dataset.

import pandas as pd

df = pd.DataFrame({
    'id': [1, 2, 2, 3],
    'value': ['a', 'b', 'c', 'd']
})

# Retain the original (first) row for each duplicate

result = df.drop_duplicates(subset=['id'], keep='first')

Retaining the Last Occurrence Based on Position

Conversely, keep='last' retains the final occurrence of each duplicate set. This is useful when your data represents chronological updates and the most recent entry contains the authoritative information.


# Retain the most recent (last) row for each duplicate

result = df.drop_duplicates(subset=['id'], keep='last')

Implementing Custom Conditions with Pandas Drop Duplicates

While the keep parameter handles positional logic, true conditional retention—such as keeping the row with the highest score or latest timestamp—requires combining sort_values() with drop_duplicates().

Sorting Before Deduplication to Control Retention

To retain the row meeting a specific condition (e.g., maximum value in another column), first sort your DataFrame by that condition column in descending order, then use keep='first' to capture the top value.

df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B'],
    'score': [10, 20, 15, 25],
    'data': ['x', 'y', 'z', 'w']
})

# Keep the row with the highest score in each group

df_sorted = df.sort_values('score', ascending=False)
result = df_sorted.drop_duplicates(subset=['group'], keep='first')

Using Boolean Indexing with duplicated() for Complex Conditions

For scenarios requiring logic beyond simple sorting, combine the duplicated() method with boolean indexing. This approach, implemented in the pandas source code at pandas/core/frame.py, returns a boolean Series that you can invert and combine with additional conditions.


# Identify duplicates while keeping the first occurrence

is_duplicate = df.duplicated(subset=['group'], keep='first')

# Complex condition: keep first duplicate OR rows with score > 15

mask = ~is_duplicate | (df['score'] > 15)
result = df[mask]

Source Code Implementation Details

The drop_duplicates() method is implemented in pandas/core/frame.py within the DataFrame class. The method signature reveals the core parameters controlling duplicate removal:

def drop_duplicates(
    self,
    subset: Hashable | Sequence[Hashable] | None = None,
    keep: Literal["first"] | Literal["last"] | Literal[False] = "first",
    inplace: bool = False,
    ignore_index: bool = False,
) -> DataFrame | None:

According to the source code, the method internally calls self.duplicated(subset=subset, keep=keep) to generate a boolean mask, then inverts this mask (~mask) to select non-duplicate rows. When keep=False, all duplicate entries are removed entirely, leaving only unique observations.

Performance Considerations

When processing large datasets, the order of operations significantly impacts performance. Sorting before deduplication requires O(n log n) time complexity, while drop_duplicates alone operates in O(n) time. For datasets exceeding memory capacity, consider using subset parameter to limit column comparisons rather than processing the entire DataFrame.

Additionally, setting inplace=True modifies the DataFrame without creating a copy, reducing memory overhead. However, according to the implementation in pandas/core/frame.py, the inplace operation still creates an intermediate copy internally before reassignment.

Summary

  • Use keep='first' in drop_duplicates() to retain the original (first) occurrence of duplicate rows.
  • Use keep='last' to retain the final occurrence when working with time-series or updated records.
  • For conditional retention based on other column values (e.g., keeping the row with maximum score), sort by the condition column first, then apply drop_duplicates().
  • The duplicated() method provides boolean masking for complex, multi-condition deduplication logic.
  • The implementation resides in pandas/core/frame.py and operates by inverting a boolean mask generated from hash-based duplicate detection.

Frequently Asked Questions

What is the difference between keep='first' and keep='last' in pandas drop_duplicates?

The keep='first' parameter marks all duplicate instances as True except for the initial occurrence, effectively retaining the first (original) row encountered in the DataFrame. Conversely, keep='last' marks all duplicates except the final occurrence, retaining the last row in the sequence. According to the source code in pandas/core/frame.py, this logic is implemented by iterating through hash tables and flagging positions based on their order of appearance.

How do I drop duplicates but keep the row with the maximum value in another column?

To retain the row with the maximum value in a specific column while removing duplicates, first sort your DataFrame by that column in descending order using sort_values(), then apply drop_duplicates() with keep='first'. For example: df.sort_values('score', ascending=False).drop_duplicates('group', keep='first'). This technique leverages the fact that drop_duplicates retains the first encountered row after sorting places the desired maximum value at the top of each duplicate group.

Can I use drop_duplicates with a custom function instead of the keep parameter?

While drop_duplicates() does not accept arbitrary functions for the keep parameter, you can achieve custom conditional logic by combining groupby() with idxmax(), idxmin(), or apply(). For instance, to keep rows based on a complex condition, use df.loc[df.groupby('key')['value'].idxmax()] instead of drop_duplicates. This approach provides the flexibility of custom functions while maintaining the performance benefits of vectorized pandas operations.

Does drop_duplicates modify the original DataFrame in place?

By default, drop_duplicates() returns a new DataFrame and leaves the original unchanged. However, you can modify the original DataFrame in place by setting the inplace=True parameter, which updates the existing object without creating a copy. According to the implementation in pandas/core/frame.py, even when inplace=True, the method internally creates a deduplicated copy before reassigning it to the original DataFrame's underlying data structure.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s https://instagit.com/install.md

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client