Pandas Read JSON File: Efficient Methods for Nested Structures

The most efficient method to pandas read json file containing nested structures is using pd.read_json() without normalization arguments, which automatically routes through the _simple_json_normalize fast path in pandas/io/json/_normalize.py, while complex extractions require explicit pd.json_normalize() with targeted record_path and meta parameters.

When working with the pandas-dev/pandas codebase, ingesting hierarchical JSON data requires navigating internal optimization pipelines. The read_json implementation delegates to a specialized JsonReader class that detects structural complexity and automatically selects between high-speed single-pass flattening or recursive normalization algorithms.

How Pandas Read JSON File Detects Nested Structures

Inside pandas/io/json/_json.py, the read_json() function inspects incoming arguments to determine the processing route. When you provide no record_path, meta, or max_level parameters, the implementation hits an early-return branch around line 555 that invokes _simple_json_normalize rather than the full generic parser. This detection mechanism ensures common cases receive optimized treatment without manual configuration.

The _simple_json_normalize helper (lines 558-572 in pandas/io/json/_normalize.py) performs a single recursive walk that flattens dictionaries while preserving column order. Unlike the heavyweight generic parser used for full-featured normalization, this path avoids building intermediate Python objects, making it ideal for the standard pattern pd.read_json(path, orient='records').

The Automatic Fast Path

For most nested JSON objects where you simply need a flattened table, let pandas handle the optimization automatically. When read_json detects the "basic case"—no explicit normalization arguments supplied—it routes data through _simple_json_normalize instead of the more complex json_normalize implementation.

This fast path leverages nested_to_record and maintains key order using _normalize_json_ordered, completing the transformation in a single pass. The underlying parsing utilizes the vendored ujson C-engine located in pandas/_libs/src/vendored/ujson/python/ujson.c for maximum performance.

When to Use Explicit json_normalize

If your data requires extracting specific sub-arrays (like an "items" list inside each record) while preserving parent metadata fields, call pd.json_normalize() directly. This function, implemented starting at line 300 in pandas/io/json/_normalize.py, builds upon the same low-level utilities but adds support for:

  • record_path: Target specific nested lists for extraction
  • meta: Include parent fields as columns in the final frame
  • record_prefix and meta_prefix: Prevent column name collisions
  • max_level: Limit flattening recursion depth

The implementation uses _pull_field and _pull_records to fetch nested data, then flattens each record with nested_to_record (lines 70-78 in _normalize.py).

Streaming Large Files

For datasets exceeding available memory, use the chunking interface provided by JsonReader. In pandas/io/json/_json.py, the JsonReader class (lines 990-1014) implements an iterator protocol that yields DataFrame chunks without loading the entire file into memory.

This approach works with both the ujson engine and the optional pyarrow backend, processing line-delimited JSON files sequentially to maintain a minimal footprint.

Code Examples

Flat Line-Delimited JSON (Fastest Path)

import pandas as pd

# File contains one JSON object per line

df = pd.read_json("data/line_delimited.json", lines=True, orient="records")
print(df.head())

Behind the scenes: read_json streams lines through JsonReader._read_ujson, building the DataFrame directly via FrameParser without normalization overhead.

Nested JSON with Automatic Flattening

import pandas as pd

# No extra arguments triggers the fast path

df = pd.read_json("data/nested.json")
print(df.head())

This execution hits the optimization at line 555 of _json.py, calling _simple_json_normalize for single-pass recursive flattening.

Extracting Sub-Lists with Metadata

import pandas as pd

data = [
    {"id": 1, "info": {"author": "Alice"}, "items": [{"sku": "A", "qty": 2},
                                                    {"sku": "B", "qty": 5}]},
    {"id": 2, "info": {"author": "Bob"},   "items": [{"sku": "C", "qty": 1}]}
]

df = pd.json_normalize(
    data,
    record_path="items",
    meta=["id", ["info", "author"]],
    record_prefix="item_",
    meta_prefix="meta_",
)

The function uses targeted recursive extraction via _pull_records, traversing only the required branches rather than fully flattening the entire hierarchy.

Streaming with Chunk Processing

import pandas as pd

reader = pd.read_json("big_file.json", lines=True, chunksize=100_000)
for chunk in reader:
    # Process each chunk independently

    print(chunk.shape)

The JsonReader object yields DataFrame chunks through its __next__ method, maintaining constant memory usage regardless of file size.

Summary

  • Default automatic flattening: Call pd.read_json(path) without normalization arguments to trigger the _simple_json_normalize fast path in pandas/io/json/_normalize.py (lines 558-572).
  • Complex extractions: Use pd.json_normalize() with explicit record_path and meta parameters for targeted recursive extraction of nested sub-arrays.
  • Memory efficiency: Process massive files using chunksize with JsonReader to stream line-delimited JSON without loading the entire dataset.
  • Engine optimization: The default ujson C-engine in pandas/_libs/src/vendored/ujson/python/ujson.c provides the fastest parsing for all JSON variants.

Frequently Asked Questions

What makes _simple_json_normalize faster than regular json_normalize?

_simple_json_normalize (lines 558-572 in pandas/io/json/_normalize.py) performs a single recursive walk optimized for dictionary flattening without constructing intermediate metadata dictionaries or handling prefix arguments. The standard json_normalize supports complex field extraction and metadata propagation, which requires additional overhead for parameter processing and recursive record pulling.

When should I use lines=True with pd.read_json?

Use lines=True when your file contains line-delimited JSON (one JSON object per line). This setting allows JsonReader to stream the file sequentially using the ujson C-engine, significantly reducing memory usage compared to loading the entire JSON array structure at once. This is the most efficient configuration for large datasets according to the pandas/io/json/_json.py implementation.

How does pandas handle deeply nested objects during flattening?

Both _simple_json_normalize and json_normalize use nested_to_record (located around lines 70-78 in pandas/io/json/_normalize.py) to recursively traverse nested dictionaries. The function converts nested keys into dot-separated column names (e.g., info.author). For json_normalize, you can control recursion depth using the max_level parameter, while the automatic fast path flattens all levels unconditionally.

Can I process JSON files larger than system memory?

Yes, by using the chunksize parameter in pd.read_json(), which returns a JsonReader iterator implemented in pandas/io/json/_json.py (lines 990-1014). This approach yields DataFrame chunks of the specified row count without loading the entire file, allowing you to process terabyte-scale line-delimited JSON files on limited hardware by iterating through chunks in a for loop.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →