How to Create a New Column in a Pandas DataFrame Using value_counts
To create a new column containing value counts in pandas, map the result of Series.value_counts() back to the original column using map(), or use groupby().transform('size') for an index-preserving alternative.
Creating a frequency count column is a common data preprocessing task when analyzing distributions within your dataset. The pandas library provides vectorized operations that eliminate the need for explicit Python loops, leveraging optimized Cython routines under the hood. This guide demonstrates how to attach value count results to a DataFrame using methods implemented in the pandas-dev/pandas repository.
Understanding value_counts in pandas
The value_counts functionality exists at two levels in the pandas API. Series.value_counts (implemented in pandas/core/series.py around line 2300) counts unique values within a single column, while DataFrame.value_counts (located in pandas/core/frame.py at lines 8383–8450) counts unique combinations of rows across multiple columns.
For the task of counting values in a single column and attaching those counts back to the original DataFrame, you need the Series implementation. This method returns a Series where the index contains unique values and the corresponding data contains their frequencies.
Under the hood, Series.value_counts delegates to value_counts_internal in pandas/core/algorithms.py (lines 839–934), which ultimately calls the low-level Cython routine in pandas/_libs/algos.pyx for performance-critical counting operations.
Method 1: Using map() with Series.value_counts()
The most direct approach involves computing the counts and then mapping them back to the original column. This works because the Series returned by value_counts uses the unique values as its index, making it perfectly suited for the map() operation.
import pandas as pd
df = pd.DataFrame({
"fruit": ["apple", "banana", "apple", "orange", "banana", "banana"],
"price": [1.2, 0.8, 1.3, 0.9, 0.85, 0.8],
})
# Compute value counts and map back to create new column
counts = df["fruit"].value_counts()
df["fruit_count"] = df["fruit"].map(counts)
print(df)
Output:
fruit price fruit_count
0 apple 1.20 2
1 banana 0.80 3
2 apple 1.30 2
3 orange 0.90 1
4 banana 0.85 3
5 banana 0.80 3
This method is efficient because map() performs a hash-based lookup using the index of the counts Series. According to the pandas source code, the resulting Series from value_counts is already hash-indexed, enabling fast O(1) lookups for each row during the mapping operation.
Method 2: Using groupby().transform()
An alternative that stays entirely within the DataFrame API uses groupby() combined with transform(). This approach preserves the original DataFrame's index automatically without requiring an explicit mapping step.
# Create new column using groupby transform
df["fruit_count_alt"] = df.groupby("fruit")["fruit"].transform("size")
Both methods produce identical results, but the groupby approach internally calls the same counting routine while handling the alignment mechanics internally. This can be advantageous when working with complex indices or when you need to perform additional aggregations within the same groupby operation.
Handling Missing Values and Categorical Data
Missing Values
By default, value_counts excludes NaN values from the count (using dropna=True). To include missing values in your frequency counts, explicitly set dropna=False:
counts = df["fruit"].value_counts(dropna=False)
df["fruit_with_nan"] = df["fruit"].map(counts)
Categorical Data
The mapping pattern works seamlessly with categorical columns without requiring conversion to object dtype. The value_counts method handles categorical data efficiently by utilizing the underlying category codes:
df["category"] = pd.Categorical(["A", "B", "A", "C", "B", "B"])
df["cat_counts"] = df["category"].map(df["category"].value_counts())
Performance and Implementation Details
The efficiency of these operations stems from pandas' layered architecture. When you call df["column"].value_counts(), the execution flows through these key files:
pandas/core/series.py: Implements the high-levelvalue_countsmethod for Series objectspandas/core/algorithms.py: Containsvalue_counts_internal(lines 839–934), which handles the algorithm selection and preprocessingpandas/_libs/algos.pyx: Provides the Cython-optimized counting implementation that processes the actual data
The map() method leverages the fact that the counts Series is already indexed by the unique values, eliminating the need for expensive join operations. For extremely large datasets, this hash-map approach significantly outperforms iterative solutions or merge-based alternatives.
Summary
- Use
map()withvalue_counts()to create a frequency column by mapping the counts Series back to your original DataFrame column. - Use
groupby().transform('size')as a concise alternative that preserves index alignment automatically. - Set
dropna=Falseinvalue_counts()to include null values in your frequency calculations. - Reference the source implementation in
pandas/core/series.pyandpandas/core/algorithms.pyto understand the underlying optimization via Cython routines inpandas/_libs/algos.pyx.
Frequently Asked Questions
Why does my new column contain NaN values after using map() with value_counts?
This occurs when your original column contains values that were excluded from the value_counts result, typically NaN values (since dropna=True by default) or values filtered by the subset parameter. To include missing values in your counts, compute the frequencies using df["col"].value_counts(dropna=False) before mapping.
Is groupby().transform() faster than using map() with value_counts()?
Both methods utilize optimized pandas internals, but performance characteristics vary by dataset shape. The map() approach creates an intermediate Series and performs hash lookups, while groupby().transform() uses the grouping machinery. For most use cases, performance differences are negligible; choose based on code readability and whether you need additional groupby operations.
Can I use value_counts on multiple columns to create a combined frequency column?
For counting combinations of multiple columns, use DataFrame.value_counts() (implemented in pandas/core/frame.py), but note that this returns a Series with a MultiIndex of combinations rather than a mappable result. To attach combined counts back to the original DataFrame, use groupby() on multiple columns with transform('size') instead.
Does value_counts work with all data types in pandas?
Yes, value_counts supports any dtype that pandas can hash, including categorical, nullable integer (Int64), string, datetime, and object types. The underlying Cython implementation in pandas/_libs/algos.pyx handles type-specific optimizations automatically.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →