How to Combine Two DataFrames in pandas: concat, merge, and join Explained
Use pd.concat to stack DataFrames vertically or horizontally, DataFrame.merge for SQL-style joins on keys, or DataFrame.join for index-aligned operations.
The pandas library provides high-performance tools for combining datasets. Whether you need to stack tables, join on keys, or align by index, the pandas-dev/pandas repository implements these operations through optimized algorithms backed by the block manager. Understanding how to combine two DataFrames in pandas requires selecting the appropriate primitive based on your data alignment strategy.
Three Primary Methods to Combine DataFrames
Pandas offers three distinct approaches for combining DataFrames, each targeting different alignment scenarios according to the source code.
pd.concat for Vertical and Horizontal Stacking
The pd.concat function stacks DataFrames along a specified axis. According to the implementation in pandas/core/reshape/concat.py, this operation uses the ConcatOperation class to handle axis alignment, key management, and hierarchical indexing.
- Axis 0 (vertical): Stacks rows when columns share compatible labels.
- Axis 1 (horizontal): Stacks columns, aligning on index values and generating
NaNfor mismatched keys.
DataFrame.merge for SQL-Style Relational Joins
Implemented in pandas/core/reshape/merge.py, the merge function (accessible as pd.merge or the DataFrame.merge method) performs relational joins. The underlying _merge function parses join keys, handles suffixes for overlapping columns, and executes inner, left, right, outer, or cross joins.
DataFrame.join for Index-Based Convenience
The join method, defined in pandas/core/frame.py, provides a simplified interface for joining on index values. It is essentially a thin wrapper around merge that sets left_index=True and right_index=True, optimized for the common case where the index serves as the join key.
Practical Code Examples
Vertical Concatenation with pd.concat
Stack DataFrames vertically while resetting the index:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'A': [3, 4], 'B': ['z', 'w']})
result = pd.concat([df1, df2], ignore_index=True)
print(result)
Output:
| A | B | |
|---|---|---|
| 0 | 1 | x |
| 1 | 2 | y |
| 2 | 3 | z |
| 3 | 4 | w |
Implementation path: pandas/core/reshape/concat.py → ConcatOperation → block manager concat.
Horizontal Concatenation with Different Indexes
Combine columns while aligning on indexes, generating NaN for mismatched keys:
df1 = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
df2 = pd.DataFrame({'B': [10, 20]}, index=['b', 'c'])
result = pd.concat([df1, df2], axis=1)
print(result)
Result:
| A | B | |
|---|---|---|
| a | 1 | NaN |
| b | 2 | 10 |
| c | NaN | 20 |
Inner Join Using pd.merge
Perform a SQL-style inner join on a common key column:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'A': [1, 2, 3]})
right = pd.DataFrame({'key': ['K0', 'K2', 'K3'],
'B': [4, 5, 6]})
merged = pd.merge(left, right, on='key', how='inner')
print(merged)
Output:
| key | A | B | |
|---|---|---|---|
| 0 | K0 | 1 | 4 |
| 1 | K2 | 3 | 5 |
Implementation path: pandas/core/reshape/merge.py → _merge function → block manager alignment.
Left Join with Suffixes for Overlapping Columns
Handle duplicate column names using the suffixes parameter:
left = pd.DataFrame({'key': ['K0', 'K1'],
'value': [1, 2]})
right = pd.DataFrame({'key': ['K0', 'K0'],
'value': [3, 4]})
joined = pd.merge(left, right, on='key', how='left', suffixes=('_L', '_R'))
print(joined)
Result:
| key | value_L | value_R | |
|---|---|---|---|
| 0 | K0 | 1 | 3 |
| 1 | K0 | 1 | 4 |
| 2 | K1 | 2 | NaN |
Index-Based Join Using DataFrame.join
Merge on index values using the convenient join method:
df_left = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
df_right = pd.DataFrame({'B': [3, 4]}, index=['b', 'c'])
joined = df_left.join(df_right, how='outer')
print(joined)
Output:
| A | B | |
|---|---|---|
| a | 1 | NaN |
| b | 2 | 3 |
| c | NaN | 4 |
join is implemented as a thin wrapper around merge with left_index=True/right_index=True in pandas/core/frame.py.
Key Implementation Files
Understanding the source architecture helps optimize performance and debug edge cases:
| File | Purpose |
|---|---|
pandas/core/reshape/concat.py |
Core logic for pd.concat; handles axis alignment and hierarchical indexing via ConcatOperation. |
pandas/core/reshape/merge.py |
Implements relational join algorithms, parsing how, on, and suffixes parameters. |
pandas/core/frame.py |
Defines DataFrame.join and method dispatchers for combining operations. |
pandas/core/internals/concat.py |
Low-level block manager utilities that enable copy-on-write memory efficiency. |
All three combination methods share lazy alignment semantics, automatically aligning data on indexes or keys without unnecessary copies until required.
Summary
- Use
pd.concatwhen stacking DataFrames vertically (axis=0) or horizontally (axis=1) along compatible axes, implemented inpandas/core/reshape/concat.py. - Use
DataFrame.merge(orpd.merge) for SQL-style joins on specific columns, supporting inner, left, right, and outer joins viapandas/core/reshape/merge.py. - Use
DataFrame.joinfor fast index-to-index joins, which wrapsmergelogic insidepandas/core/frame.py. - All methods preserve memory through copy-on-write mechanisms and generate
NaNfor non-matching keys unlesshow='inner'is specified.
Frequently Asked Questions
What is the difference between merge and join in pandas?
merge is the general-purpose function for joining DataFrames on arbitrary columns, while join is a convenience method optimized for index-based alignment. According to the source in pandas/core/frame.py, join simply calls merge with left_index=True and right_index=True set by default, making it syntactic sugar for the common case of merging on indexes.
When should I use concat instead of merge to combine two DataFrames in pandas?
Use concat when you need to append DataFrames along an axis without matching keys—essentially stacking tables. Use merge when you need to align rows based on common key values. concat in pandas/core/reshape/concat.py handles vertical and horizontal stacking, while merge in pandas/core/reshape/merge.py performs relational algebra.
How do I handle overlapping column names when merging DataFrames?
Pass a tuple to the suffixes parameter in merge, such as suffixes=('_x', '_y'). This disambiguates columns that exist in both DataFrames but are not used as join keys. The _merge function in pandas/core/reshape/merge.py automatically applies these suffixes during the join operation.
Why does combining DataFrames produce NaN values?
Missing values appear when indexes or join keys exist in one DataFrame but not the other. This occurs during outer alignment in concat or outer joins in merge. Specify how='inner' to retain only matching keys, or use ignore_index=True in concat to reset the index and avoid alignment gaps.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md