How to Concatenate a List of pandas DataFrames Using pandas.concat
Use pandas.concat() to efficiently merge multiple DataFrames vertically or horizontally while controlling index alignment, column matching, and memory usage through parameters like axis, join, and ignore_index.
The pandas.concat() function serves as the primary method for combining collections of DataFrame objects in the pandas-dev/pandas library. Located in pandas/core/reshape/concat.py, this implementation handles complex edge cases such as mismatched columns, hierarchical indexing, and duplicate labels while leveraging C-extensions for high-performance data buffer operations.
Core Architecture and Source Implementation
The concatenation logic resides in pandas/core/reshape/concat.py, where the public concat() function validates arguments and delegates to an internal _concat helper. This helper orchestrates a ConcatPlanner object that normalizes inputs, determines resulting axis dimensions, and constructs the final DataFrame by referencing underlying blocks. The actual data buffer manipulation occurs in optimized C-extensions, making this approach significantly faster than iterative methods like DataFrame.append().
Two additional internal modules support this process:
pandas/core/internals/concat.pymanages low-level block concatenation logicpandas/core/dtypes/concat.pyreconciles dtype differences between input objects
Essential Parameters for Concatenating DataFrames
Understanding the key parameters allows precise control over concatenation behavior:
-
objs(required): An iterable of DataFrame or Series objects. Typically passed as a list:[df1, df2, df3]. -
axis: Integer specifying the concatenation axis. Use0(default) for vertical stacking along rows, or1for horizontal concatenation along columns. -
join: String specifying how to handle non-matching labels.'outer'(default) performs a union of columns/indexes, while'inner'keeps only the intersection. -
ignore_index: Boolean; whenTrue, creates a new sequential integer index (0, 1, 2...) for the result, discarding original index values. Useful when original indices are meaningless after concatenation. -
keys: Sequence of labels to prefix each input block, creating a hierarchical MultiIndex that tracks the source of each row. -
sort: Boolean; whether to sort the non-concatenated axis when the other axis is not aligned. Defaults toFalseto avoid unnecessary overhead. -
verify_integrity: Boolean; whenTrue, checks for duplicate labels in the new axis and raisesValueErrorif duplicates exist. -
copy: Boolean; whenFalse, allows underlying data sharing between input and output for better performance, though this requires caution with mutable views.
Practical Examples
Vertical Concatenation Along Axis 0
By default, pandas.concat() stacks DataFrames vertically along axis 0, preserving all columns from the union of inputs:
import pandas as pd
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"A": [5, 6], "B": [7, 8]})
result = pd.concat([df1, df2]) # axis=0 is default
print(result)
A B
0 1 3
1 2 4
0 5 7
1 6 8
Horizontal Concatenation with Different Columns
Specify axis=1 to concatenate DataFrames side-by-side. Use the join parameter to control whether to perform an outer or inner join on the indexes:
df3 = pd.DataFrame({"C": [9, 10]})
horiz = pd.concat([df1, df3], axis=1, join="outer")
print(horiz)
A B C
0 1 3 9
1 2 4 10
Resetting Indexes with ignore_index
When original index values are meaningless after combination, set ignore_index=True to create a clean, continuous integer index:
ignore = pd.concat([df1, df2], ignore_index=True)
print(ignore)
A B
0 1 3
1 2 4
2 5 7
3 6 8
Creating Hierarchical Indexes with keys
Use the keys parameter to add a hierarchical level that identifies the source of each row:
keyed = pd.concat([df1, df2], keys=["first", "second"])
print(keyed)
A B
first 0 1 3
1 2 4
second 0 5 7
1 6 8
Efficiently Concatenating Large Lists
For long lists of DataFrames, pandas.concat() processes them in a single operation rather than iteratively copying data:
frames = [pd.DataFrame({"val": range(i, i+3)}) for i in range(0, 30, 3)]
big = pd.concat(frames, ignore_index=True)
print(big.head())
val
0 0
1 1
2 2
3 3
4 4
Summary
pandas.concat()inpandas/core/reshape/concat.pyis the optimal method for combining lists of DataFrames, outperforming iterativeappend()operations through itsConcatPlannerarchitecture and C-extension optimizations.- The
axisparameter controls vertical (0) versus horizontal (1) concatenation. - Use
ignore_index=Trueto generate a fresh sequential index when source indexes are irrelevant. - The
joinparameter manages column alignment, with'outer'preserving all columns and'inner'keeping only shared columns. - Setting
copy=Falsecan improve memory efficiency by sharing underlying data buffers, though this requires careful handling of mutable views.
Frequently Asked Questions
What is the difference between pandas.concat and DataFrame.append?
DataFrame.append() is deprecated and essentially wraps pandas.concat() with limited functionality. According to the pandas source code, concat() handles multiple DataFrames efficiently through its ConcatPlanner architecture and C-extension optimizations, while append() processes items iteratively and creates unnecessary copies.
How do I concatenate DataFrames with different columns?
Use join='outer' (default) to include all columns from all DataFrames, filling missing values with NaN. Alternatively, use join='inner' to keep only columns present in every DataFrame. Control the axis by setting axis=1 for side-by-side concatenation.
Why should I use ignore_index when concatenating?
Set ignore_index=True when the original row identifiers from individual DataFrames become meaningless in the combined dataset. This creates a new integer index (0 to n-1) for the result, eliminating duplicate index values that might otherwise cause confusion or errors in downstream operations.
Is pandas.concat memory efficient?
Yes, particularly when setting copy=False. The implementation in pandas/core/reshape/concat.py minimizes data copying by referencing underlying blocks and utilizing C-extensions for buffer handling. However, be cautious with copy=False if you plan to modify the resulting DataFrame, as changes may propagate to the original inputs.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md