How to Use pandas groupby to Aggregate DataFrame Rows into Lists

To group DataFrame rows into lists using pandas groupby, use the agg(list) method on a GroupBy object, which collects all values in each group into Python lists for each column.

The pandas groupby functionality in the pandas-dev/pandas repository provides a powerful way to split DataFrames into subgroups and aggregate them. According to the source code in pandas/core/groupby/groupby.py, the GroupBy class implements lazy evaluation that only materializes groups when aggregation functions are called. This architecture makes it efficient to collect row values into Python lists using built-in aggregation primitives.

Understanding the pandas groupby Architecture

The GroupBy machinery resides primarily in pandas/core/groupby/groupby.py and handles the heavy lifting of splitting, applying, and combining data. When you call df.groupby(keys), pandas builds an index of group labels and stores the positions of rows belonging to each group without immediately processing the data.

The aggregation flow works as follows:

  • GroupBy._aggregate — Dispatches to the appropriate aggregation logic based on the supplied function(s)
  • _agg_general — Handles generic functions including built-ins like list, concatenating values of each column into a list for the corresponding group
  • SeriesGroupBy / DataFrameGroupBy — Subclasses providing column-specific aggregation shortcuts via pandas/core/groupby/generic.py
  • get_agg_function — Resolves string identifiers like 'list' to the actual Python list callable

Because aggregation is performed lazily, only the groups you request are materialized, making this approach memory-efficient for large datasets.

Method 1: Aggregate All Columns into Lists with agg(list)

The most direct way to collect all rows per group into lists is calling agg(list) on the GroupBy object. This triggers the generic aggregation path in pandas/core/groupby/groupby.py, where _agg_general loops over groups and applies the list constructor to each column's values.

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "A", "B", "B", "C"],
    "value": [10, 20, 30, 40, 50],
    "detail": ["x", "y", "z", "w", "v"]
})

# Aggregate entire rows into lists per group

grouped = df.groupby("category").agg(list)
print(grouped)

This produces a DataFrame where each cell contains a list of values from that group:


         value    detail
category                
A       [10, 20]  [x, y]
B       [30, 40]  [z, w]
C       [50]      [v]

Method 2: Aggregate a Single Column Using SeriesGroupBy

When you only need to aggregate one column, select the column before calling agg(list). This creates a SeriesGroupBy object that returns a Series with list values rather than a full DataFrame.


# Collect only a single column as a list per group

list_per_category = df.groupby("category")["value"].agg(list)
print(list_per_category)

Output:


category
A    [10, 20]
B    [30, 40]
C        [50]
Name: value, dtype: object

This approach is more memory-efficient than aggregating all columns when you only need specific data.

Method 3: Mixed Aggregations with Column-Specific Functions

The agg API in pandas/core/groupby/agg.py supports dictionaries that map different columns to different aggregation functions. This allows you to collect some columns into lists while applying scalar aggregations (like sum or mean) to others.


# Use different aggregations for different columns

result = df.groupby("category").agg({
    "value": "sum",          # numeric aggregation

    "detail": list           # collect details as list

})
print(result)

Output:


          value detail
category              
A            30   [x, y]
B            70   [z, w]
C            50      [v]

The string 'list' resolves to the built-in Python list function via the aggregation registry, while 'sum' uses the optimized numpy aggregation path.

Method 4: Custom List Building with apply

For scenarios requiring custom logic beyond simple list aggregation, use the apply method implemented in pandas/core/groupby/apply.py. This method provides access to the full sub-DataFrame for each group, allowing complex transformations before returning lists.

def rows_to_list(sub_df):
    # Return the entire sub-frame as a list of row Series

    return [row for _, row in sub_df.iterrows()]

custom = df.groupby("category").apply(rows_to_list)
print(custom)

Note that apply materializes each subgroup as a DataFrame, making it less performant than agg(list) for simple list collection, but necessary for custom row structures.

Summary

  • agg(list) is the most efficient method for grouping DataFrame rows into lists, utilizing the _agg_general helper in pandas/core/groupby/groupby.py
  • Selective aggregation via df.groupby(key)[column].agg(list) minimizes memory overhead by processing only necessary columns
  • Mixed aggregations allow combining list collection with scalar functions using dictionary syntax
  • Lazy evaluation ensures only requested groups are materialized, making pandas groupby suitable for large datasets
  • apply() offers flexibility for custom list constructions but incurs higher computational overhead than agg

Frequently Asked Questions

How does pandas groupby handle list aggregation internally?

According to the pandas source code in pandas/core/groupby/groupby.py, when you pass list to the agg method, the _aggregate method dispatches to _agg_general. This helper iterates over each group, applies the Python list constructor to the column values, and concatenates the results into the output DataFrame structure.

What is the difference between using agg(list) and apply(list) in pandas groupby?

agg(list) uses the optimized aggregation path via _agg_general, which is faster and more memory-efficient because it avoids creating intermediate DataFrame objects for each group. apply(list) invokes the machinery in pandas/core/groupby/apply.py, which passes the actual sub-DataFrame to the function, offering more flexibility but with higher computational overhead.

Can I group rows into lists while keeping other columns as scalar values?

Yes, use a dictionary with the agg method to specify different aggregation functions per column. For example, df.groupby('key').agg({'col1': list, 'col2': 'sum', 'col3': 'mean'}) collects col1 into lists while computing scalar sums and means for other columns, as implemented in pandas/core/groupby/agg.py.

Is pandas groupby memory efficient when aggregating large datasets into lists?

Yes, the GroupBy architecture uses lazy evaluation where only the groups actually requested are materialized. The _agg_general routine processes groups iteratively rather than loading the entire dataset into memory, making it suitable for large-scale list aggregations.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s https://instagit.com/install.md

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client