How to Use pandas sort by column: A Complete Guide to DataFrame.sort_values
Use DataFrame.sort_values() to reorder rows by column values, specifying the by parameter for single or multiple columns, and control the sort order with ascending, kind, and key arguments.
When you need to organize tabular data in the pandas-dev/pandas repository, the most efficient approach for a pandas sort by column operation is the sort_values() method. This high-performance function leverages optimized NumPy routines and a lightweight sorting engine to rearrange DataFrame rows without unnecessary data copying. Whether you are ranking sales figures, ordering timestamps, or prioritizing categorical labels, understanding the underlying implementation helps you write faster, more memory-efficient code.
Understanding the pandas sort by column Implementation
The DataFrame.sort_values method is not merely a convenience wrapper; it is a sophisticated pipeline that delegates heavy computation to highly optimized low-level routines.
The Core Architecture: From sort_values to safe_sort
Internally, sort_values is implemented as a thin wrapper around the generic NDFrame base class logic found in pandas/core/generic.py (lines 4868-4990). This entry point validates arguments such as by, axis, ascending, and kind, then determines the target columns by extracting them from the DataFrame’s block manager.
The concrete DataFrame-specific type signatures and overloads reside in pandas/core/frame.py (lines 7923-7950). These ensure that when you pass a string or list of strings to the by parameter, pandas correctly resolves them to column positions before proceeding to value extraction.
How safe_sort Handles the Heavy Lifting
Once columns are identified, the actual sorting logic is delegated to safe_sort in pandas/core/algorithms.py (lines 1431-1500). This dependency-free helper performs the following critical steps:
- Algorithm Selection: It uses NumPy’s
argsortunder the hood, defaulting toquicksortunless you specifykind='mergesort',heapsort, orstable. - Type Handling: For mixed-type arrays, it routes data through
_sort_mixedor_sort_tuplesto ensure consistent ordering. - NaN Management: The
na_positionargument is applied here, determining whether missing values float to the top or sink to the bottom. - Key Function Application: If you provide a vectorized
keyfunction (e.g.,str.lower), it is applied to the column values before the sort permutation is calculated.
After safe_sort returns a permutation index, the DataFrame’s block manager applies this index via self._mgr.take, reordering rows efficiently without copying unnecessary data.
Practical Examples: pandas sort by column in Action
The following examples demonstrate how to leverage sort_values for common data organization tasks.
Sort by a Single Column
To perform a simple alphabetical pandas sort by column, pass the column name as a string to the by parameter:
import pandas as pd
df = pd.DataFrame(
{
"city": ["Paris", "Berlin", "London", "Tokyo", "New York"],
"population": [2_200_000, 3_600_000, 8_900_000, 13_900_000, 8_300_000],
"area_km2": [105, 891, 1572, 2194, 783],
}
)
# Sort alphabetically by city name
sorted_by_city = df.sort_values(by="city")
print(sorted_by_city)
city population area_km2
1 Berlin 3600000 891
4 New York 8300000 783
2 London 8900000 1572
0 Paris 2200000 105
3 Tokyo 13900000 2194
Sort by Multiple Columns with Different Orders
For complex ranking, supply a list to by and a matching list to ascending. This example sorts by descending population, then ascending area:
# Primary sort: population (high to low)
# Secondary sort: area (low to high) for ties
sorted_multi = df.sort_values(
by=["population", "area_km2"],
ascending=[False, True],
kind="stable", # Preserves original order when values are equal
)
print(sorted_multi)
city population area_km2
3 Tokyo 13900000 2194
2 London 8900000 1572
4 New York 8300000 783
1 Berlin 3600000 891
0 Paris 2200000 105
Using a Custom Key Function
Apply vectorized transformations before sorting without modifying the original data. This example performs a case-insensitive sort:
# Sort ignoring case sensitivity
sorted_key = df.sort_values(
by="city",
key=lambda s: s.str.lower()
)
print(sorted_key)
city population area_km2
1 Berlin 3600000 891
4 New York 8300000 783
2 London 8900000 1572
0 Paris 2200000 105
3 Tokyo 13900000 2194
In-Place Sorting for Memory Efficiency
When working with large datasets, avoid copying data by sorting in place:
# Modify the DataFrame directly, returns None
df.sort_values(by="population", inplace=True, ascending=False)
print(df)
city population area_km2
3 Tokyo 13900000 2194
2 London 8900000 1572
4 New York 8300000 783
1 Berlin 3600000 891
0 Paris 2200000 105
Performance Considerations for Large DataFrames
The efficiency of pandas sort by column operations stems from the architecture described in the source code. Because safe_sort in pandas/core/algorithms.py is a lightweight, dependency-free routine, it minimizes overhead when processing millions of rows.
Key performance characteristics include:
- Algorithm Selection: Choose
kind='mergesort'orkind='stable'when you need to preserve the relative order of equal elements; usekind='quicksort'(default) for fastest average-case performance on numeric data. - Memory Management: The
inplace=Trueparameter triggersself._mgr.takedirectly on the block manager, avoiding the memory overhead of creating a new DataFrame object. - Vectorized Keys: Applying a
keyfunction operates on the entire Series via vectorized string methods (e.g.,.str.lower()), which is significantly faster than row-wise Python loops.
Summary
DataFrame.sort_valuesis the primary method for pandas sort by column operations, implemented inpandas/core/generic.pyandpandas/core/frame.py.- The actual sorting logic delegates to
safe_sortinpandas/core/algorithms.py, which uses NumPy'sargsortand handles mixed types, NaN positioning, and stability. - You can sort by single or multiple columns using the
byparameter, control direction withascending, and apply transformations with thekeyargument. - For large datasets, use
inplace=Trueto minimize memory usage and select appropriate algorithms (kind) based on stability requirements.
Frequently Asked Questions
How do I sort a pandas DataFrame by column values in descending order?
Pass ascending=False to the sort_values method. If sorting by multiple columns, provide a list of booleans matching the length of your by parameter, such as ascending=[False, True] to sort the first column descending and the second ascending.
What is the difference between sort_values and sort_index in pandas?
sort_values rearranges rows based on the data contained within one or more columns, while sort_index reorders rows or columns based on their index labels (row names) or column names. Use sort_values for value-based ranking and sort_index when you need to organize data by its positional or named indices.
How does pandas handle missing values when sorting by column?
By default, sort_values places NaN values at the end of the DataFrame regardless of the sort order. You can control this behavior using the na_position parameter, setting it to 'first' to float missing values to the top or 'last' to keep them at the bottom.
Is the pandas sort_values method stable?
Yes, when you specify kind='stable' or kind='mergesort', the sort preserves the relative order of rows that have equal values in the specified columns. This stability is implemented in the safe_sort function within pandas/core/algorithms.py, which uses NumPy's stable sorting algorithms when requested.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →