How to Use pandas groupby to Aggregate DataFrame Rows into Lists
To group DataFrame rows into lists using pandas groupby, use the agg(list) method on a GroupBy object, which collects all values in each group into Python lists for each column.
The pandas groupby functionality in the pandas-dev/pandas repository provides a powerful way to split DataFrames into subgroups and aggregate them. According to the source code in pandas/core/groupby/groupby.py, the GroupBy class implements lazy evaluation that only materializes groups when aggregation functions are called. This architecture makes it efficient to collect row values into Python lists using built-in aggregation primitives.
Understanding the pandas groupby Architecture
The GroupBy machinery resides primarily in pandas/core/groupby/groupby.py and handles the heavy lifting of splitting, applying, and combining data. When you call df.groupby(keys), pandas builds an index of group labels and stores the positions of rows belonging to each group without immediately processing the data.
The aggregation flow works as follows:
GroupBy._aggregate— Dispatches to the appropriate aggregation logic based on the supplied function(s)_agg_general— Handles generic functions including built-ins likelist, concatenating values of each column into a list for the corresponding groupSeriesGroupBy/DataFrameGroupBy— Subclasses providing column-specific aggregation shortcuts viapandas/core/groupby/generic.pyget_agg_function— Resolves string identifiers like'list'to the actual Pythonlistcallable
Because aggregation is performed lazily, only the groups you request are materialized, making this approach memory-efficient for large datasets.
Method 1: Aggregate All Columns into Lists with agg(list)
The most direct way to collect all rows per group into lists is calling agg(list) on the GroupBy object. This triggers the generic aggregation path in pandas/core/groupby/groupby.py, where _agg_general loops over groups and applies the list constructor to each column's values.
import pandas as pd
df = pd.DataFrame({
"category": ["A", "A", "B", "B", "C"],
"value": [10, 20, 30, 40, 50],
"detail": ["x", "y", "z", "w", "v"]
})
# Aggregate entire rows into lists per group
grouped = df.groupby("category").agg(list)
print(grouped)
This produces a DataFrame where each cell contains a list of values from that group:
value detail
category
A [10, 20] [x, y]
B [30, 40] [z, w]
C [50] [v]
Method 2: Aggregate a Single Column Using SeriesGroupBy
When you only need to aggregate one column, select the column before calling agg(list). This creates a SeriesGroupBy object that returns a Series with list values rather than a full DataFrame.
# Collect only a single column as a list per group
list_per_category = df.groupby("category")["value"].agg(list)
print(list_per_category)
Output:
category
A [10, 20]
B [30, 40]
C [50]
Name: value, dtype: object
This approach is more memory-efficient than aggregating all columns when you only need specific data.
Method 3: Mixed Aggregations with Column-Specific Functions
The agg API in pandas/core/groupby/agg.py supports dictionaries that map different columns to different aggregation functions. This allows you to collect some columns into lists while applying scalar aggregations (like sum or mean) to others.
# Use different aggregations for different columns
result = df.groupby("category").agg({
"value": "sum", # numeric aggregation
"detail": list # collect details as list
})
print(result)
Output:
value detail
category
A 30 [x, y]
B 70 [z, w]
C 50 [v]
The string 'list' resolves to the built-in Python list function via the aggregation registry, while 'sum' uses the optimized numpy aggregation path.
Method 4: Custom List Building with apply
For scenarios requiring custom logic beyond simple list aggregation, use the apply method implemented in pandas/core/groupby/apply.py. This method provides access to the full sub-DataFrame for each group, allowing complex transformations before returning lists.
def rows_to_list(sub_df):
# Return the entire sub-frame as a list of row Series
return [row for _, row in sub_df.iterrows()]
custom = df.groupby("category").apply(rows_to_list)
print(custom)
Note that apply materializes each subgroup as a DataFrame, making it less performant than agg(list) for simple list collection, but necessary for custom row structures.
Summary
agg(list)is the most efficient method for grouping DataFrame rows into lists, utilizing the_agg_generalhelper inpandas/core/groupby/groupby.py- Selective aggregation via
df.groupby(key)[column].agg(list)minimizes memory overhead by processing only necessary columns - Mixed aggregations allow combining list collection with scalar functions using dictionary syntax
- Lazy evaluation ensures only requested groups are materialized, making pandas groupby suitable for large datasets
apply()offers flexibility for custom list constructions but incurs higher computational overhead thanagg
Frequently Asked Questions
How does pandas groupby handle list aggregation internally?
According to the pandas source code in pandas/core/groupby/groupby.py, when you pass list to the agg method, the _aggregate method dispatches to _agg_general. This helper iterates over each group, applies the Python list constructor to the column values, and concatenates the results into the output DataFrame structure.
What is the difference between using agg(list) and apply(list) in pandas groupby?
agg(list) uses the optimized aggregation path via _agg_general, which is faster and more memory-efficient because it avoids creating intermediate DataFrame objects for each group. apply(list) invokes the machinery in pandas/core/groupby/apply.py, which passes the actual sub-DataFrame to the function, offering more flexibility but with higher computational overhead.
Can I group rows into lists while keeping other columns as scalar values?
Yes, use a dictionary with the agg method to specify different aggregation functions per column. For example, df.groupby('key').agg({'col1': list, 'col2': 'sum', 'col3': 'mean'}) collects col1 into lists while computing scalar sums and means for other columns, as implemented in pandas/core/groupby/agg.py.
Is pandas groupby memory efficient when aggregating large datasets into lists?
Yes, the GroupBy architecture uses lazy evaluation where only the groups actually requested are materialized. The _agg_general routine processes groups iteratively rather than loading the entire dataset into memory, making it suitable for large-scale list aggregations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md