How to Read Excel Files Faster in Pandas: Optimizing read_excel Performance

Use the calamine engine with python-calamine installed, limit data with usecols and nrows, and enable read_only mode in engine_kwargs to significantly speed up pandas.read_excel for large workbooks.

The pandas.read_excel function in the pandas-dev/pandas repository provides a convenient interface for loading Excel files into DataFrames, but default settings often use pure-Python engines that bottleneck performance on large workbooks. Understanding how to leverage faster engines and optimization parameters allows you to reduce read excel pandas execution time from minutes to seconds.

How pandas.read_excel Works Under the Hood

Under the hood, read_excel is a thin wrapper defined in pandas/io/excel/_base.py (lines 64-84) that instantiates an ExcelFile object and delegates parsing to engine-specific reader classes. The function automatically selects an engine based on the file extension—typically defaulting to openpyxl for .xlsx files or xlrd for legacy .xls files. These default engines are pure Python implementations that can struggle with large datasets, whereas alternative engines like calamine or pyxlsb leverage compiled backends to minimize I/O overhead.

Faster Ways to Read Excel in Pandas

Use the Calamine Engine for C++-Backed Performance

The fastest way to read excel pandas workloads is often the calamine engine, which calls into the compiled python-calamine library (C++ backend). Implemented in pandas/io/excel/_calamine.py (lines 20-90), this engine reads .xls, .xlsx, .xlsm, .xlsb, and .ods files significantly faster than pure-Python alternatives.

Install the optional dependency and force the engine:

import pandas as pd

# pip install python-calamine

df = pd.read_excel(
    "large_file.xlsx",
    engine="calamine",
    usecols="A:D",      # Only columns A through D

    nrows=5000,         # First 5,000 rows only

)
print(df.shape)  # (5000, 4)

Limit Data Scope with usecols and nrows

Regardless of engine choice, reduce parsing overhead by reading only the data you need. The usecols parameter accepts column letters, indices, or callable functions, while nrows restricts the number of rows parsed. This prevents the engine from processing entire worksheets when you only need a subset.

import pandas as pd

# Read specific columns by index and limit rows

df = pd.read_excel(
    "data.xlsx",
    usecols=[0, 2, 4],  # First, third, and fifth columns

    nrows=1000,
)

Enable Read-Only Mode for Streaming

When using the openpyxl engine (the default for modern .xlsx files), enable read_only mode via engine_kwargs to open files in a streaming, low-memory configuration. This is implemented in pandas/io/excel/_openpyxl.py and significantly reduces overhead for very large spreadsheets where you only need to iterate once. The engine_kwargs parameter is forwarded to the underlying engine as shown in pandas/io/excel/_base.py (lines 197-210).

import pandas as pd

df = pd.read_excel(
    "large_file.xlsx",
    engine="openpyxl",
    engine_kwargs={"read_only": True, "data_only": True},
    usecols=[0, 2, 4],
)

Prefer Binary Engines for XLSB Files

For files saved in the binary Excel format (.xlsb), avoid conversion overhead by using the pyxlsb engine. This reads the binary format directly rather than parsing XML, offering substantial speed improvements for large binary workbooks.

import pandas as pd

df = pd.read_excel(
    "big_file.xlsb",
    engine="pyxlsb",
    usecols="A:C",
)

Summary

  • pandas.read_excel delegates to engine-specific readers defined in pandas/io/excel/_base.py, with default pure-Python engines often creating performance bottlenecks.
  • The calamine engine (from pandas/io/excel/_calamine.py) provides the fastest read excel pandas performance by leveraging a C++ backend.
  • Reduce I/O by specifying usecols and nrows to parse only required data subsets.
  • Enable read_only mode in engine_kwargs when using openpyxl to stream large files with lower memory overhead.
  • Use pyxlsb for binary .xlsb files to avoid XML parsing overhead.

Frequently Asked Questions

What is the fastest engine for read_excel in pandas?

The Calamine engine is currently the fastest option for most Excel formats. Implemented in pandas/io/excel/_calamine.py, it uses the python-calamine library with a C++ backend to read .xlsx, .xls, .xlsm, .xlsb, and .ods files significantly faster than pure-Python alternatives like openpyxl or xlrd.

How do I install the calamine engine for pandas?

Calamine is an optional dependency. Install it using pip with pip install python-calamine. Once installed, specify engine="calamine" in your pd.read_excel() call. Pandas will automatically use the CalamineReader class defined in pandas/io/excel/_calamine.py to parse the workbook.

Can I read only specific columns with read_excel?

Yes. Use the usecols parameter to limit which columns are parsed. You can pass column letters (e.g., "A:D"), indices (e.g., [0, 2, 4]), or a callable function. This prevents the engine from processing unnecessary data, significantly reducing memory usage and parsing time for large files.

What is the difference between openpyxl and calamine engines?

Openpyxl is the default pure-Python engine for modern .xlsx files, implemented in pandas/io/excel/_openpyxl.py. It offers features like formula evaluation and write support but can be slow with large datasets. Calamine is a Rust/C++ backed engine (via python-calamine) implemented in pandas/io/excel/_calamine.py that prioritizes read performance and memory efficiency but is read-only. Choose Calamine for speed, Openpyxl for compatibility and write operations.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →