How to Use pandas groupby count and mean to Compute Group Statistics
Use the agg() method with ['count', 'mean'] to calculate both non-null counts and arithmetic averages for each group in a single vectorized operation.
The pandas-dev/pandas repository implements grouping logic through the GroupBy family of classes located in pandas/core/groupby/. When you call DataFrame.groupby(), you receive a DataFrameGroupBy object that provides efficient, Cython-backed aggregation methods. Understanding how to combine pandas groupby count operations with statistical reductions like mean allows you to generate comprehensive group summaries without iterative Python loops.
Computing Count and Mean with agg()
The most efficient way to obtain both statistics simultaneously is passing a list of function names to the agg() method. This approach reuses the underlying aggregation kernel and aligns results automatically.
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'A', 'B', 'B', 'B', 'C'],
'value1': [10, 15, 10, None, 30, 25],
'value2': [1, 2, 3, 4, 5, 6]
})
# Calculate count of non-null values and mean for each column per group
result = df.groupby('category').agg(['count', 'mean'])
print(result)
value1 value2
count mean count mean
category
A 2 12.5 2 1.5
B 2 20.0 3 4.0
C 1 25.0 1 6.0
As shown above, count excludes NaN values (category B shows 2 for value1 despite having 3 rows), while mean computes only on non-missing entries.
count() vs size(): Choosing the Right Row Counter
The GroupBy class in pandas/core/groupby/generic.py implements two distinct methods for counting rows:
count()(line ~1315): Returns the number of non-null values per column for each group, implemented via_groupby_aggwith the"count"operation.size()(line ~1365): Returns the total number of rows per group regardless of missing values, includingNaNentries.
When you need the raw group size rather than valid observations, use size():
# Total rows per group (includes NaN rows)
group_sizes = df.groupby('category').size()
print(group_sizes)
category
A 2
B 3
C 1
dtype: int64
Named Aggregations for Custom Output Columns
For cleaner column names, use named aggregations (Python 3.6+) within agg(). This syntax lets you specify custom result column names while selecting specific source columns and functions.
custom_stats = df.groupby('category').agg(
row_count=('value1', 'size'), # Equivalent to size()
valid_count=('value1', 'count'), # Non-null count
avg_value=('value1', 'mean') # Arithmetic mean
)
print(custom_stats)
row_count valid_count avg_value
category
A 2 2 12.5
B 3 2 20.0
C 1 1 25.0
Implementation Details in pandas Source Code
According to the pandas source code, both count() and mean() leverage the same private method _groupby_agg defined around line 1191 in pandas/core/groupby/generic.py. This method builds a Cython-backed aggregation plan located in pandas/core/groupby/ops.py, which iterates over group indices and applies the requested reduction.
The agg() method loops over your supplied function list, reuses this aggregation engine for each statistic, and concatenates results along a hierarchical column index. This architecture ensures that separate calls to count() and mean() share the same group boundary calculations, minimizing overhead when computing multiple statistics.
Summary
- Use
.agg(['count', 'mean'])to compute multiple statistics in one call, returning a DataFrame with hierarchical columns. count()tallies only non-null values per column, whilesize()counts all rows per group including those with missing data.- Named aggregations (
agg(name=('column', 'func'))) produce clean, flat column headers instead of MultiIndex columns. - The underlying implementation in
pandas/core/groupby/generic.pyuses Cython-optimized kernels inops.pyfor efficient group-wise calculations.
Frequently Asked Questions
What is the difference between count and size in pandas groupby?
count() returns the number of non-null values for each column within the group, excluding NaN entries. size() returns the total number of rows in each group regardless of missing values. Use count() when analyzing data completeness and size() when you need the raw group cardinality.
How do I calculate different statistics for different columns?
Pass a dictionary to agg() where keys are column names and values are functions or lists of functions. For example: df.groupby('key').agg({'col1': 'mean', 'col2': ['count', 'sum']}). This computes the mean for col1 and both count and sum for col2.
Why does my groupby count return fewer rows than expected?
The count() method ignores NaN values by design. If your data contains missing values, the count will reflect only valid observations. Switch to size() if you need to count all rows including those with null data.
How can I flatten the MultiIndex columns after using agg?
After calling .agg(['count', 'mean']), the resulting columns are a MultiIndex. Flatten them by joining the levels: result.columns = ['_'.join(col).strip() for col in result.columns.values] or by using named aggregations to avoid creating a MultiIndex initially.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md