How to Read Excel Files Faster in Pandas: Optimizing read_excel Performance
Use the calamine engine with python-calamine installed, limit data with usecols and nrows, and enable read_only mode in engine_kwargs to significantly speed up pandas.read_excel for large workbooks.
The pandas.read_excel function in the pandas-dev/pandas repository provides a convenient interface for loading Excel files into DataFrames, but default settings often use pure-Python engines that bottleneck performance on large workbooks. Understanding how to leverage faster engines and optimization parameters allows you to reduce read excel pandas execution time from minutes to seconds.
How pandas.read_excel Works Under the Hood
Under the hood, read_excel is a thin wrapper defined in pandas/io/excel/_base.py (lines 64-84) that instantiates an ExcelFile object and delegates parsing to engine-specific reader classes. The function automatically selects an engine based on the file extension—typically defaulting to openpyxl for .xlsx files or xlrd for legacy .xls files. These default engines are pure Python implementations that can struggle with large datasets, whereas alternative engines like calamine or pyxlsb leverage compiled backends to minimize I/O overhead.
Faster Ways to Read Excel in Pandas
Use the Calamine Engine for C++-Backed Performance
The fastest way to read excel pandas workloads is often the calamine engine, which calls into the compiled python-calamine library (C++ backend). Implemented in pandas/io/excel/_calamine.py (lines 20-90), this engine reads .xls, .xlsx, .xlsm, .xlsb, and .ods files significantly faster than pure-Python alternatives.
Install the optional dependency and force the engine:
import pandas as pd
# pip install python-calamine
df = pd.read_excel(
"large_file.xlsx",
engine="calamine",
usecols="A:D", # Only columns A through D
nrows=5000, # First 5,000 rows only
)
print(df.shape) # (5000, 4)
Limit Data Scope with usecols and nrows
Regardless of engine choice, reduce parsing overhead by reading only the data you need. The usecols parameter accepts column letters, indices, or callable functions, while nrows restricts the number of rows parsed. This prevents the engine from processing entire worksheets when you only need a subset.
import pandas as pd
# Read specific columns by index and limit rows
df = pd.read_excel(
"data.xlsx",
usecols=[0, 2, 4], # First, third, and fifth columns
nrows=1000,
)
Enable Read-Only Mode for Streaming
When using the openpyxl engine (the default for modern .xlsx files), enable read_only mode via engine_kwargs to open files in a streaming, low-memory configuration. This is implemented in pandas/io/excel/_openpyxl.py and significantly reduces overhead for very large spreadsheets where you only need to iterate once. The engine_kwargs parameter is forwarded to the underlying engine as shown in pandas/io/excel/_base.py (lines 197-210).
import pandas as pd
df = pd.read_excel(
"large_file.xlsx",
engine="openpyxl",
engine_kwargs={"read_only": True, "data_only": True},
usecols=[0, 2, 4],
)
Prefer Binary Engines for XLSB Files
For files saved in the binary Excel format (.xlsb), avoid conversion overhead by using the pyxlsb engine. This reads the binary format directly rather than parsing XML, offering substantial speed improvements for large binary workbooks.
import pandas as pd
df = pd.read_excel(
"big_file.xlsb",
engine="pyxlsb",
usecols="A:C",
)
Summary
pandas.read_exceldelegates to engine-specific readers defined inpandas/io/excel/_base.py, with default pure-Python engines often creating performance bottlenecks.- The
calamineengine (frompandas/io/excel/_calamine.py) provides the fastest read excel pandas performance by leveraging a C++ backend. - Reduce I/O by specifying
usecolsandnrowsto parse only required data subsets. - Enable
read_onlymode inengine_kwargswhen usingopenpyxlto stream large files with lower memory overhead. - Use
pyxlsbfor binary.xlsbfiles to avoid XML parsing overhead.
Frequently Asked Questions
What is the fastest engine for read_excel in pandas?
The Calamine engine is currently the fastest option for most Excel formats. Implemented in pandas/io/excel/_calamine.py, it uses the python-calamine library with a C++ backend to read .xlsx, .xls, .xlsm, .xlsb, and .ods files significantly faster than pure-Python alternatives like openpyxl or xlrd.
How do I install the calamine engine for pandas?
Calamine is an optional dependency. Install it using pip with pip install python-calamine. Once installed, specify engine="calamine" in your pd.read_excel() call. Pandas will automatically use the CalamineReader class defined in pandas/io/excel/_calamine.py to parse the workbook.
Can I read only specific columns with read_excel?
Yes. Use the usecols parameter to limit which columns are parsed. You can pass column letters (e.g., "A:D"), indices (e.g., [0, 2, 4]), or a callable function. This prevents the engine from processing unnecessary data, significantly reducing memory usage and parsing time for large files.
What is the difference between openpyxl and calamine engines?
Openpyxl is the default pure-Python engine for modern .xlsx files, implemented in pandas/io/excel/_openpyxl.py. It offers features like formula evaluation and write support but can be slow with large datasets. Calamine is a Rust/C++ backed engine (via python-calamine) implemented in pandas/io/excel/_calamine.py that prioritizes read performance and memory efficiency but is read-only. Choose Calamine for speed, Openpyxl for compatibility and write operations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →