How to Use Pandas Unique on a Whole DataFrame Based on a Column
Use DataFrame.drop_duplicates(subset='column') to retrieve complete rows for each unique value in a specific column, or use Series.unique() when you only need the distinct values as a NumPy array.
The pandas-dev/pandas repository provides optimized, C-level implementations for deduplication workflows. When working with tabular data, applying pandas unique on a whole dataframe based on a column typically requires preserving the full row context associated with each distinct value, which demands a different approach than simple value extraction.
Choosing Between Series.unique() and DataFrame.drop_duplicates()
Pandas offers two primary mechanisms for handling uniqueness, each defined in separate core modules:
Series.unique() (implemented in pandas/core/series.py) returns a NumPy array containing only the distinct values from a single column. This method discards all other columns and does not preserve the original DataFrame structure.
DataFrame.drop_duplicates() (implemented in pandas/core/frame.py) returns a new DataFrame containing the first (or last) occurrence of each unique value in the specified column while retaining all other columns intact. This matches the conventional interpretation of getting unique rows based on a column value.
Both methods utilize vectorized operations that execute in compiled C loops, ensuring high performance even on datasets containing millions of rows.
Implementation Details in the Source Code
The underlying functionality resides in two critical files within the pandas source tree:
pandas/core/series.py: Contains theunique()method for Series objects, which delegates to hashtable-based algorithms for distinct value extraction.pandas/core/frame.py: Houses thedrop_duplicates()method, which manages row-wise deduplication while supporting complex subset logic and ordering controls.
Additionally, index-level uniqueness operations referenced by both methods are defined in pandas/core/indexes/base.py.
Practical Code Examples
Extracting Unique Values from a Single Column
When you need only the distinct values without row context, access the column as a Series and call unique():
import pandas as pd
df = pd.DataFrame({
"city": ["Paris", "London", "Paris", "Berlin"],
"population": [2.1, 8.9, 2.1, 3.6]
})
# Returns NumPy array of unique city names
unique_cities = df["city"].unique()
print(unique_cities)
['Paris' 'London' 'Berlin']
Retaining Unique Rows Based on a Column
To perform pandas unique on a whole dataframe based on a column while keeping all associated data, use drop_duplicates() with the subset parameter:
# Keep the first occurrence of each unique city
unique_rows = df.drop_duplicates(subset="city")
print(unique_rows)
city population
0 Paris 2.1
1 London 8.9
3 Berlin 3.6
Preserving the Last Occurrence Instead of the First
Control which duplicate row survives using the keep parameter:
# Retain the last row for each unique city
unique_rows_last = df.drop_duplicates(subset="city", keep="last")
print(unique_rows_last)
city population
2 Paris 2.1
1 London 8.9
3 Berlin 3.6
Sorting Results After Deduplication
Chain methods to deduplicate first, then reorder the results:
# Get unique cities then sort by population descending
unique_sorted = (
df.drop_duplicates(subset="city")
.sort_values("population", ascending=False)
)
print(unique_sorted)
city population
1 London 8.9
3 Berlin 3.6
0 Paris 2.1
Summary
DataFrame.drop_duplicates(subset='col')returns complete rows for each unique value in the specified column, making it the correct choice for "unique on a whole DataFrame" operations.Series.unique()extracts only the distinct values as a NumPy array, discarding all other column data.- Both methods are implemented in C-accelerated code within
pandas/core/frame.pyandpandas/core/series.pyrespectively. - The
keepparameter controls whether to preserve the first, last, or no occurrences of duplicate values. - These operations return new objects and do not modify the original DataFrame in-place.
Frequently Asked Questions
What is the difference between Series.unique() and DataFrame.drop_duplicates()?
Series.unique() returns a NumPy array containing only the distinct values from a single column, while DataFrame.drop_duplicates() returns a new DataFrame containing entire rows. Use unique() when you need a list of values for lookup or iteration, and use drop_duplicates() when you need to preserve the full row context associated with each unique value.
How do I keep the last duplicate row instead of the first?
Pass keep='last' to the drop_duplicates() method. By default, pandas retains the first occurrence (keep='first'), but setting this parameter to 'last' ensures the final occurrence of each unique value survives the deduplication process.
Does drop_duplicates() modify the original DataFrame?
No, drop_duplicates() returns a new DataFrame and leaves the original unchanged. According to the implementation in pandas/core/frame.py, this method creates a copy of the data with duplicate rows removed based on the specified subset columns.
Can I get unique rows based on multiple columns?
Yes, pass a list of column names to the subset parameter: df.drop_duplicates(subset=['col1', 'col2']). This identifies uniqueness based on the combination of values across all specified columns, returning only the first occurrence of each distinct combination.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →