How to Add Row to DataFrame Pandas: Best Practices for Large Datasets
The most efficient way to add row to DataFrame pandas is to accumulate data in a list or small DataFrames and perform a single construction or concatenation, avoiding repeated row-wise appends that trigger expensive BlockManager rebuilds.
When working with large datasets in the pandas-dev/pandas repository, understanding the internal data structure is crucial for performance. Pandas stores column data in contiguous NumPy arrays called blocks, managed by the low-level BlockManager. Each time you append a single row, pandas must reconstruct this entire manager, leading to O(n) memory copies that cripple performance at scale.
Why Row-by-Row Append Is Slow in Pandas
The naive approach of repeatedly calling df = df.append(row) (now deprecated) or similar row-wise methods triggers a deep internal rebuild. In pandas/core/frame.py, the private method _append_internal handles row appends by forwarding operations to the generic concatenation engine in pandas/core/reshape/concat.py【source】(https://github.com/pandas-dev/pandas/blob/main/pandas/core/frame.py#L14335).
This concatenation creates a brand-new DataFrame from the operands, allocating fresh memory and copying all existing data. When performed in a loop, this results in quadratic time complexity—every iteration copies an ever-growing DataFrame.
The actual block-level concatenation occurs in pandas/core/internals/managers.py within concatenate_managers【source】(https://github.com/pandas-dev/pandas/blob/main/pandas/core/internals/managers.py#L1965), which stacks underlying NumPy buffers. While efficient for bulk operations, it cannot optimize repeated single-row calls.
Efficient Methods to Add Row to DataFrame Pandas
Collect Rows in a List and Construct Once
The most memory-efficient pattern avoids intermediate DataFrames entirely. Accumulate rows as dictionaries or Series in a Python list, then pass the entire list to the DataFrame constructor in a single operation.
import pandas as pd
# Efficient: Single construction from list of dicts
rows = [{'a': i, 'b': i * 2} for i in range(1_000_000)]
df = pd.DataFrame(rows)
This approach allows pandas to build the BlockManager structure in one pass through the data, eliminating the copy overhead of incremental growth.
Accumulate DataFrames and Concatenate Once
When processing data in batches (such as chunked file reads or streaming API responses), accumulate small DataFrames in a list and perform a single pd.concat operation at the end.
# Efficient: Batch concatenation
chunks = []
for chunk in pd.read_csv('big_file.csv', chunksize=100_000):
# Transformations applied per chunk
chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)
The concat function in pandas/core/reshape/concat.py optimizes this by concatenating at the block level through concatenate_managers, reusing existing blocks where possible and minimizing memory copies compared to iterative appends.
Pre-allocate and Fill with .loc
When the final DataFrame size is known beforehand, pre-allocate the structure with a placeholder index and fill rows using .loc assignment. This avoids BlockManager reconstruction because the underlying arrays are already sized correctly.
import numpy as np
# Pre-allocate with target size
N = 1_000_000
df = pd.DataFrame(index=range(N), columns=['a', 'b'])
# Fill via .loc (O(1) per assignment when index exists)
for i in range(N):
df.loc[i] = [i, i * 2]
This method directly writes into the underlying NumPy arrays without triggering the expensive reallocation path, though it requires knowing the final row count in advance.
Use Vectorized Column Assignment
For adding many rows that share column-wise patterns, use DataFrame.assign or direct column assignment rather than row-wise iteration. This leverages pandas' vectorized operations implemented in C.
# Vectorized column addition
N = 1_000_000
df = pd.DataFrame({'a': np.arange(N)})
df = df.assign(b=lambda x: x['a'] * 2) # Single vectorized operation
What to Avoid
Never use DataFrame.append in performance-critical code. This method is deprecated and scheduled for removal. Internally, it is merely a thin wrapper around _append_internal that forces a full copy of the entire DataFrame on every call, resulting in quadratic memory usage and execution time.
Similarly, avoid df.loc[len(df)] = ... without pre-allocation. While this pattern works for small data, it triggers the same BlockManager reconstruction as append when the index must be extended dynamically.
Summary
- Batch your operations: Collect rows in lists or small DataFrames and construct/concatenate once rather than appending iteratively.
- Pre-allocate when possible: Create DataFrames with final dimensions known and fill via
.locto avoidBlockManagerrebuilds. - Use vectorized operations: Leverage
assignand column-wise operations instead of row-wise loops. - Avoid deprecated methods: Never use
DataFrame.appendin production code; it forces expensive full copies on every call. - Understand the internals: Row appends in pandas require
BlockManagerreconstruction inpandas/core/internals/managers.py, making single-row operations inherently expensive for large datasets.
Frequently Asked Questions
Why is appending a single row to a large DataFrame so slow in pandas?
Appending a single row forces pandas to rebuild the entire BlockManager structure that stores column data as contiguous NumPy arrays. In pandas/core/internals/managers.py, the concatenate_managers function creates new block arrays by copying data, resulting in O(n) time and memory usage per append operation. When repeated in a loop, this creates quadratic complexity.
What is the most memory-efficient way to add millions of rows to a DataFrame?
The most memory-efficient method is to accumulate rows as dictionaries or lists in a Python list, then create the DataFrame once using pd.DataFrame(rows). This avoids allocating intermediate DataFrames entirely. If processing in batches, accumulate small DataFrames in a list and call pd.concat(chunks, ignore_index=True) once at the end, which concatenates at the block level without Python-level loops.
Is df.loc[len(df)] = row faster than DataFrame.append?
Both methods trigger similar internal overhead when the index must grow dynamically, but df.loc[len(df)] = row can be faster if you pre-allocate the DataFrame with a fixed index range. When pre-allocated, .loc writes directly into existing NumPy arrays without rebuilding the BlockManager. However, without pre-allocation, both methods suffer from the same quadratic copying behavior, and DataFrame.append is deprecated and should be avoided entirely.
How does pd.concat achieve better performance than iterative appends?
pd.concat in pandas/core/reshape/concat.py delegates to concatenate_managers in pandas/core/internals/managers.py, which stacks underlying NumPy buffers at the C level. By operating on blocks rather than individual rows, it minimizes Python interpreter overhead and memory allocations. When concatenating a list of DataFrames, pandas can reuse existing block structures and perform a single allocation for the result, rather than copying the entire dataset on every append operation.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →