How to Use Pandas Groupby Multiple Columns to Count Items in Each Group
Use df.groupby(['col1', 'col2']).size() to count rows in each group, or df.groupby(['col1', 'col2'])['col3'].count() to count non-null values in a specific column.
The pandas groupby multiple columns functionality is a core feature of the pandas-dev/pandas repository, enabling complex data aggregation across categorical combinations. When you pass a list of column names to the groupby method, pandas constructs a MultiIndex and leverages optimized Cython routines to compute counts efficiently, even on large datasets.
How Pandas Groupby Multiple Columns Works Internally
The DataFrame.groupby Entry Point
The public API for grouping operations begins in pandas/core/frame.py, where the DataFrame.groupby method is defined. When you call df.groupby(['city', 'year']), this method validates the column list and delegates to the internal grouping machinery.
GroupBy Object Construction
The actual grouping logic resides in pandas/core/groupby/generic.py, which implements the DataFrameGroupBy class. This class inherits from the base GroupBy class and provides methods like .size() and .count() that you call after grouping.
MultiIndex Key Generation
In pandas/core/groupby/groupby.py, the base GroupBy class handles the mechanics of grouping, including MultiIndex construction for multiple columns. When you pass a list of column names, pandas creates a hierarchical index where each level represents one grouping column, enabling efficient lookup and aggregation.
Optimized Counting Operations
The actual counting performance comes from pandas/_libs/hashtable.pyx, a Cython-optimized module that implements hash-based grouping routines. When you call .size() or .count(), these low-level routines walk the grouped blocks without Python overhead, ensuring high performance even on millions of rows.
Practical Examples: Counting with Pandas Groupby Multiple Columns
Count Rows in Each Group with .size()
The .size() method returns the number of rows for each combination of grouping columns, including groups with NaN values.
import pandas as pd
df = pd.DataFrame({
"city": ["NY", "NY", "LA", "LA", "NY", "LA"],
"year": [2020, 2020, 2021, 2021, 2021, 2020],
"sales": [100, 150, 200, 250, 300, 400],
"product": ["A", "B", "A", "B", "A", "B"]
})
# Group by two columns and count rows per group
counts = df.groupby(["city", "year"]).size()
print(counts)
Output:
city year
LA 2020 1
2021 2
NY 2020 2
2021 1
dtype: int64
Count Non-Null Values in a Specific Column
Use .count() on a specific column to exclude NaN values from the count.
# Count non-null sales entries per city-year group
sales_counts = df.groupby(["city", "year"])["sales"].count()
print(sales_counts)
Convert to Flat DataFrame with reset_index()
The .reset_index() method converts the MultiIndex result into a regular DataFrame with named columns.
# Convert Series to DataFrame with custom column name
result = counts.reset_index(name="group_count")
print(result)
Output:
city year group_count
0 LA 2020 1
1 LA 2021 2
2 NY 2020 2
3 NY 2021 1
Group by Three Columns and Aggregate
Extend the pattern to three or more columns by adding elements to the list.
# Group by three columns and sum sales
sum_sales = df.groupby(["city", "year", "product"])["sales"].sum()
print(sum_sales)
Summary
- Pandas groupby multiple columns accepts a list of column names:
df.groupby(['col1', 'col2']). - The method constructs a MultiIndex internally for efficient hierarchical grouping.
- Use
.size()to count all rows per group, including those with NaN values. - Use
.count()on a specific column to count only non-null values. - Core implementation files include
pandas/core/frame.py,pandas/core/groupby/groupby.py, and the Cython-optimizedpandas/_libs/hashtable.pyx.
Frequently Asked Questions
What is the difference between .size() and .count() in pandas groupby?
.size() returns the total number of rows in each group, including NaN values, and returns a Series with the group labels as the index. .count() returns the number of non-null values for each column (or a specific column if selected), automatically excluding NaN values from the tally.
How do I convert the MultiIndex result from groupby back to a regular DataFrame?
Call .reset_index() on the resulting Series or DataFrame. This converts the MultiIndex levels into regular columns. You can also pass name='count' to reset_index when working with a Series to name the values column appropriately.
Can I group by more than two columns using the same method?
Yes, you can group by any number of columns by extending the list passed to groupby. For example, df.groupby(['col1', 'col2', 'col3', 'col4']).size() works identically to the two-column case, creating a MultiIndex with four levels.
Why is pandas groupby with multiple columns fast even on large datasets?
Pandas optimizes grouping operations through Cython implementations in pandas/_libs/hashtable.pyx, which use hash-based algorithms to group data without Python loop overhead. The MultiIndex construction in pandas/core/groupby/groupby.py also ensures that key lookups remain efficient regardless of dataset size.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md