How to Find and Sort Pandas Unique Values in a Column
Use df['col'].unique() to extract distinct elements as a NumPy array preserving first-appearance order, then apply np.sort() for a sorted array or wrap the result in pd.Series() and call sort_values() to maintain pandas functionality with proper NaN handling.
The pandas-dev/pandas repository provides robust tools for data manipulation, where extracting and ordering distinct elements is a common requirement. When you need pandas unique values in a column, understanding the underlying implementation in Series.unique() and the available sorting strategies ensures efficient data processing.
How Series.unique() Extracts Distinct Values
In pandas/core/series.py at line 2302, the Series.unique() method returns a NumPy array containing distinct values. This method delegates the heavy lifting to pandas/core/algorithms.py at line 322, where the low-level uniquification algorithm operates directly on the underlying data buffers. The result preserves the order of first appearance and returns a NumPy ndarray, not a pandas object.
Sorting Pandas Unique Values in a Column
Once you have extracted the unique values, you have two primary approaches to sort them, each with distinct performance and functionality characteristics.
Approach 1: Fast NumPy Sorting
np.sort(series.unique()) provides the fastest path to a sorted result. This operates directly on the NumPy array returned by unique(), leveraging vectorized C-sorting without pandas overhead. This method is ideal when you only need the sorted array for display or temporary computation.
Approach 2: Pandas sort_values() with Missing Data Handling
pd.Series(unique_vals).sort_values() converts the array back to a Series and calls sort_values(), implemented in pandas/core/series.py at line 6182. This approach respects pandas-specific sorting rules, including the na_position parameter which places NaN values at the end by default. Use this when you need to chain pandas operations or require explicit control over missing value placement.
Handling Missing Values (NaN)
Both approaches respect pandas' treatment of missing data. In the source code analysis, NaN values are considered unique elements and are placed at the end of the sorted output when using pandas sorting methods. NumPy sorting will place NaN at the end for object arrays but may behave differently with float arrays, making the pandas approach more consistent for mixed data types.
Complete Code Example
import pandas as pd
import numpy as np
# Sample DataFrame
df = pd.DataFrame({
"category": ["apple", "banana", "apple", "orange", "banana", np.nan]
})
# 1️⃣ Retrieve unique values (unsorted)
unique_vals = df["category"].unique()
print("Unique (unsorted):", unique_vals)
# Output: ['apple' 'banana' 'orange' nan]
# 2️⃣ Sort using NumPy (fast, returns array)
sorted_unique_np = np.sort(unique_vals)
print("Sorted (NumPy):", sorted_unique_np)
# Output: ['apple' 'banana' 'orange' nan]
# 3️⃣ Sort using pandas (keeps pandas type, handles NaN)
sorted_unique_pd = pd.Series(unique_vals).sort_values()
print("Sorted (pandas):")
print(sorted_unique_pd)
# Output:
# 0 apple
# 1 banana
# 3 orange
# 5 NaN
# dtype: object
The df["category"].unique() call executes the unique() algorithm from pandas/core/algorithms.py, while sort_values() utilizes the implementation in pandas/core/series.py to provide index-aware sorting with NaN handling.
Summary
Series.unique()returns a NumPy array of distinct values in order of first appearance, with core logic inpandas/core/algorithms.pyat line 322.np.sort()provides the fastest sorting for simple array operations.pd.Series().sort_values()offers pandas-native sorting with proper missing value handling via the implementation inpandas/core/series.pyat line 6182.NaNvalues are treated as unique and sorted to the end by default in pandas operations.
Frequently Asked Questions
Does unique() return sorted values automatically?
No. According to the pandas source code in pandas/core/series.py, the unique() method returns values in the order of their first appearance in the Series. If you need sorted results, you must explicitly apply np.sort() or sort_values() to the output.
How does pandas handle NaN values when sorting unique results?
When using pandas sort_values(), NaN values are placed at the end of the Series by default. In the underlying implementation at pandas/core/series.py line 6182, the method respects the na_position parameter, allowing you to place missing values at the beginning or end of your sorted unique values.
Which sorting method is faster for large datasets?
NumPy sorting (np.sort(series.unique())) is faster because it operates directly on the array without creating intermediate pandas objects or handling index alignment. However, pandas sorting (pd.Series(unique_vals).sort_values()) is necessary when you require pandas-specific features like na_position control or when integrating the result into a pandas workflow.
Can I get unique values from multiple columns at once?
No, Series.unique() operates on a single column only. For multiple columns, use DataFrame.drop_duplicates() to remove duplicate rows based on specified columns, which internally uses similar hashing algorithms located in pandas/core/algorithms.py but applied across multiple Series simultaneously.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md