How to Filter for Distinct Values in a Pandas DataFrame Using the `unique` Function

Use df['column'].unique() to return an array of distinct values from a specific column, then combine the result with isin() or boolean indexing to filter the DataFrame.

The unique function provides a fast, vectorized way to extract distinct values from pandas Series objects. According to the pandas-dev/pandas source code, this operation is optimized at the C level and serves as the foundation for filtering distinct values in DataFrame workflows.

Understanding the unique Function Architecture

The unique operation in pandas is designed specifically for Series objects rather than entire DataFrames. When you call df['column'].unique(), pandas delegates the operation through multiple layers of optimized code.

Series-Level Entry Point

The public API for unique is implemented in pandas/core/series.py at line 2306. The Series.unique method extracts the underlying array data and passes it to the core algorithm:

def unique(self) -> ArrayLike:
    return algorithms.unique(self._values)

This design ensures that any Series—whether backed by NumPy arrays, ExtensionArrays, or categorical data—can leverage the same uniqueness logic.

Core Algorithm Implementation

The heavy lifting occurs in pandas/core/algorithms.py at line 322. The unique function detects the input array type and routes to specialized fast paths:

  • NumPy arrays: Delegates to hashtable-based uniqueness checks
  • ExtensionArrays: Uses type-specific implementations while preserving order
  • Categorical data: Leverages category codes for efficiency

The function returns a NumPy array or ExtensionArray containing each distinct value exactly once, preserving the order of first appearance.

How to Filter for Distinct Values in Practice

The unique function serves as the foundation for multiple distinct-value filtering patterns in pandas DataFrames.

Extracting Distinct Values from a Column

To retrieve distinct values from a specific column, access the column as a Series and call unique():

import pandas as pd

df = pd.DataFrame({
    "city": ["Paris", "Berlin", "Paris", "London", "Berlin"],
    "population": [2_200_000, 3_600_000, 2_200_000, 8_900_000, 3_600_000],
})

# Get distinct city names

distinct_cities = df["city"].unique()
print(distinct_cities)

# Output: ['Paris' 'Berlin' 'London']

This returns a NumPy array containing ['Paris', 'Berlin', 'London'] in order of first appearance.

Filtering Rows Using Distinct Values

Combine unique() with isin() to filter DataFrame rows based on distinct value membership:


# Filter rows where city is in the distinct set (illustrative pattern)

distinct_cities = df["city"].unique()
filtered_df = df[df["city"].isin(distinct_cities)]

# More practical: Filter using a subset of distinct values

target_cities = df["city"].unique()[:2]  # First two distinct cities

result = df[df["city"].isin(target_cities)]

This pattern is essential when you need to validate data against the distinct values present in your dataset.

Alternative: Using drop_duplicates for Distinct Rows

When you need distinct rows rather than just distinct values from a single column, use drop_duplicates(), which internally leverages the same uniqueness algorithms:


# Get distinct rows based on the 'city' column

distinct_rows = df.drop_duplicates(subset=["city"], keep="first")

# Equivalent to filtering by unique values and keeping first occurrence

first_occurrence_idx = df.drop_duplicates(subset="city", keep="first").index
df_distinct = df.loc[first_occurrence_idx]

As implemented in the pandas source code, drop_duplicates uses the same algorithms.unique machinery but applies it across multiple columns and row indices.

Summary

  • Series.unique is the primary method for extracting distinct values from a DataFrame column, implemented in pandas/core/series.py.
  • The core algorithm resides in pandas/core/algorithms.py and handles NumPy arrays, ExtensionArrays, and categorical data through optimized fast paths.
  • Use df['column'].unique() to return an array of distinct values in order of first appearance.
  • Combine unique() with isin() to filter DataFrame rows based on distinct value membership.
  • For distinct rows rather than distinct values, use drop_duplicates(), which leverages the same underlying uniqueness algorithms.

Frequently Asked Questions

Does unique work on DataFrames directly?

No, the unique method is defined only for Series objects. To get distinct values from a DataFrame, you must select a specific column using df['column_name'].unique(). If you attempt to call df.unique() on an entire DataFrame, pandas will raise an AttributeError.

How does unique handle NaN values?

The unique function treats NaN (Not a Number) values as distinct elements by default. According to the implementation in pandas/core/algorithms.py, NaN values are included in the returned array and are considered unique among themselves. For categorical data, NaN handling depends on whether the category explicitly includes NaN as a valid value.

What is the difference between unique and drop_duplicates?

unique operates on a single Series and returns an array of distinct values, while drop_duplicates operates on DataFrames and returns a subset of rows. The unique method is located in pandas/core/series.py and returns values in order of first appearance as a NumPy array. In contrast, drop_duplicates is a DataFrame method that can consider multiple columns and returns a DataFrame with duplicate rows removed, keeping the first or last occurrence based on the keep parameter.

Is unique faster than converting to a set?

Yes, Series.unique() is generally faster than converting to a Python set because it uses optimized C-level hashtable operations through pandas.core.algorithms.unique. The pandas implementation preserves the order of first appearance and handles pandas-specific data types (like Categorical, Datetime, and nullable integers) more efficiently than the generic Python set conversion, which requires casting and loses ordering guarantees.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →