How to Use Pandas Unique on a Whole DataFrame Based on a Column

Use DataFrame.drop_duplicates(subset='column') to retrieve complete rows for each unique value in a specific column, or use Series.unique() when you only need the distinct values as a NumPy array.

The pandas-dev/pandas repository provides optimized, C-level implementations for deduplication workflows. When working with tabular data, applying pandas unique on a whole dataframe based on a column typically requires preserving the full row context associated with each distinct value, which demands a different approach than simple value extraction.

Choosing Between Series.unique() and DataFrame.drop_duplicates()

Pandas offers two primary mechanisms for handling uniqueness, each defined in separate core modules:

Series.unique() (implemented in pandas/core/series.py) returns a NumPy array containing only the distinct values from a single column. This method discards all other columns and does not preserve the original DataFrame structure.

DataFrame.drop_duplicates() (implemented in pandas/core/frame.py) returns a new DataFrame containing the first (or last) occurrence of each unique value in the specified column while retaining all other columns intact. This matches the conventional interpretation of getting unique rows based on a column value.

Both methods utilize vectorized operations that execute in compiled C loops, ensuring high performance even on datasets containing millions of rows.

Implementation Details in the Source Code

The underlying functionality resides in two critical files within the pandas source tree:

  • pandas/core/series.py: Contains the unique() method for Series objects, which delegates to hashtable-based algorithms for distinct value extraction.
  • pandas/core/frame.py: Houses the drop_duplicates() method, which manages row-wise deduplication while supporting complex subset logic and ordering controls.

Additionally, index-level uniqueness operations referenced by both methods are defined in pandas/core/indexes/base.py.

Practical Code Examples

Extracting Unique Values from a Single Column

When you need only the distinct values without row context, access the column as a Series and call unique():

import pandas as pd

df = pd.DataFrame({
    "city": ["Paris", "London", "Paris", "Berlin"],
    "population": [2.1, 8.9, 2.1, 3.6]
})

# Returns NumPy array of unique city names

unique_cities = df["city"].unique()
print(unique_cities)
['Paris' 'London' 'Berlin']

Retaining Unique Rows Based on a Column

To perform pandas unique on a whole dataframe based on a column while keeping all associated data, use drop_duplicates() with the subset parameter:


# Keep the first occurrence of each unique city

unique_rows = df.drop_duplicates(subset="city")
print(unique_rows)
     city  population
0   Paris         2.1
1  London         8.9
3  Berlin         3.6

Preserving the Last Occurrence Instead of the First

Control which duplicate row survives using the keep parameter:


# Retain the last row for each unique city

unique_rows_last = df.drop_duplicates(subset="city", keep="last")
print(unique_rows_last)
     city  population
2   Paris         2.1
1  London         8.9
3  Berlin         3.6

Sorting Results After Deduplication

Chain methods to deduplicate first, then reorder the results:


# Get unique cities then sort by population descending

unique_sorted = (
    df.drop_duplicates(subset="city")
      .sort_values("population", ascending=False)
)
print(unique_sorted)
     city  population
1  London         8.9
3  Berlin         3.6
0   Paris         2.1

Summary

  • DataFrame.drop_duplicates(subset='col') returns complete rows for each unique value in the specified column, making it the correct choice for "unique on a whole DataFrame" operations.
  • Series.unique() extracts only the distinct values as a NumPy array, discarding all other column data.
  • Both methods are implemented in C-accelerated code within pandas/core/frame.py and pandas/core/series.py respectively.
  • The keep parameter controls whether to preserve the first, last, or no occurrences of duplicate values.
  • These operations return new objects and do not modify the original DataFrame in-place.

Frequently Asked Questions

What is the difference between Series.unique() and DataFrame.drop_duplicates()?

Series.unique() returns a NumPy array containing only the distinct values from a single column, while DataFrame.drop_duplicates() returns a new DataFrame containing entire rows. Use unique() when you need a list of values for lookup or iteration, and use drop_duplicates() when you need to preserve the full row context associated with each unique value.

How do I keep the last duplicate row instead of the first?

Pass keep='last' to the drop_duplicates() method. By default, pandas retains the first occurrence (keep='first'), but setting this parameter to 'last' ensures the final occurrence of each unique value survives the deduplication process.

Does drop_duplicates() modify the original DataFrame?

No, drop_duplicates() returns a new DataFrame and leaves the original unchanged. According to the implementation in pandas/core/frame.py, this method creates a copy of the data with duplicate rows removed based on the specified subset columns.

Can I get unique rows based on multiple columns?

Yes, pass a list of column names to the subset parameter: df.drop_duplicates(subset=['col1', 'col2']). This identifies uniqueness based on the combination of values across all specified columns, returning only the first occurrence of each distinct combination.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →