how-to-guide

How to Spark Explode Multiple Columns Efficiently and Avoid Cartesian Products

February 16, 2026 apache/spark ↗

Use arrays_zip to combine multiple array columns into a single array of structs, then apply explode once to avoid the Cartesian product and performance degradation caused by multiple generators.

Exploding multiple columns in Spark SQL is a common requirement when normalizing nested data structures, but calling explode on each column separately creates separate generator nodes that force a Cartesian product of the results. According to the Apache Spark source code, the most efficient approach is to zip the arrays into a single structure and explode it once, leveraging the ArraysZip expression and ExplodeBase generator implementation.

Why the Naive Approach Fails

When you invoke explode on several columns within the same SELECT clause, Spark instantiates multiple generator expressions. The Catalyst optimizer must then construct a Cartesian product of these generators to combine their outputs, resulting in an exponential explosion of rows and severe performance degradation.

The ExplodeBase class in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala implements the core logic for all explode-style generators. When multiple generators exist in the same projection, the planner cannot fuse them into a single Whole-Stage Code Generation unit, forcing interpretive overhead and memory amplification.

The Solution: Zip Then Explode

The efficient pattern uses arrays_zip to merge multiple array columns element-wise into a single array of structs, then applies explode exactly once. This creates only one generator node, allowing the optimizer to produce a clean Generate(Explode, ...) plan without Cartesian products.

How arrays_zip Works Under the Hood

The arrays_zip function is defined in sql/api/src/main/scala/org/apache/spark/sql/functions.scala at line 9611. It registers the ArraysZip expression, which lives in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala around line 462.

At runtime, ArraysZip constructs a GenericArrayData of InternalRow objects, where each row contains the i-th element from every input array. This struct array maintains positional alignment between the original columns without requiring joins or shuffles.

Why a Single Generator Performs Better

By reducing the operation to a single explode call, you invoke the ExplodeBase implementation only once per input row. The Catalyst optimizer can fuse this generator into a single Whole-Stage Code Generation unit, producing tight Java bytecode loops that process the zipped array sequentially. This eliminates the interpretive overhead and memory amplification caused by multiple generator nodes.

Practical Code Examples for Spark Explode Multiple Columns

Scala DataFrame API

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{arrays_zip, explode, col}

val spark = SparkSession.builder.appName("MultiExplode").getOrCreate()
import spark.implicits._

// Sample data with two array columns
val df = Seq(
  (1, Array("a", "b"), Array(10, 20)),
  (2, Array("c", "d", "e"), Array(30, 40, 50))
).toDF("id", "letters", "numbers")

// Efficient multi-column explode using arrays_zip
val exploded = df
  .withColumn("zipped", arrays_zip(col("letters"), col("numbers")))
  .withColumn("exploded", explode(col("zipped")))
  .select(
    col("id"),
    col("exploded.letters").as("letter"),
    col("exploded.numbers").as("number")
  )

exploded.show()

Output:


+---+------+------+
| id|letter|number|
+---+------+------+
|  1|     a|    10|
|  1|     b|    20|
|  2|     c|    30|
|  2|     d|    40|
|  2|     e|    50|
+---+------+------+

PySpark Implementation

from pyspark.sql import SparkSession
from pyspark.sql.functions import arrays_zip, explode, col

spark = SparkSession.builder.appName("MultiExplode").getOrCreate()

df = spark.createDataFrame([
    (1, ["a", "b"], [10, 20]),
    (2, ["c", "d", "e"], [30, 40, 50])
], ["id", "letters", "numbers"])

exploded = (df
    .withColumn("zipped", arrays_zip(col("letters"), col("numbers")))
    .withColumn("exploded", explode(col("zipped")))
    .select(
        "id",
        col("exploded.letters").alias("letter"),
        col("exploded.numbers").alias("number")
    ))

exploded.show()

Pure SQL Syntax

SELECT
  id,
  exploded.letters AS letter,
  exploded.numbers AS number
FROM (
  SELECT
    id,
    explode(arrays_zip(letters, numbers)) AS exploded
  FROM sample_table
) t

Handling Map Columns

When working with map columns instead of arrays, convert them to array entries before zipping. The map_entries function transforms a map into an array of structs containing key and value fields.

import org.apache.spark.sql.functions.{map_entries, arrays_zip, explode, col}

val dfWithMaps = ... // DataFrame with map columns m1 and m2

val explodedMaps = dfWithMaps
  .withColumn("e1", map_entries(col("m1")))
  .withColumn("e2", map_entries(col("m2")))
  .withColumn("zipped", arrays_zip(col("e1"), col("e2")))
  .withColumn("exploded", explode(col("zipped")))
  .select(
    col("exploded.e1.key").as("key1"),
    col("exploded.e1.value").as("value1"),
    col("exploded.e2.key").as("key2"),
    col("exploded.e2.value").as("value2")
  )

Optimizing with inline for Arrays of Structs

If your data is already an array of structs rather than separate array columns, use inline or inline_outer to skip the zipping step entirely. This function expands the struct fields directly into columns without intermediate struct creation.

// If arr_col is ArrayType(StructType(...))
df.selectExpr("id", "inline(arr_col) as (col1, col2, col3)")

According to sql/api/src/main/scala/org/apache/spark/sql/functions.scala, inline utilizes the same ExplodeBase hierarchy as explode but eliminates the overhead of the ArraysZip transformation, making it the most efficient pattern for pre-structured array data.

Key Source Files in Apache Spark

The multi-column explode pattern relies on these implementation files in the Apache Spark repository:

sql/api/src/main/scala/org/apache/spark/sql/functions.scala – Defines the public API for explode (line 8775), arrays_zip (line 9611), and inline, providing the entry points used in DataFrame operations.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala – Contains ExplodeBase, the abstract class that implements the core logic for all explode-style generators, including type checking and runtime row expansion.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala – Implements the ArraysZip expression (around line 462), which constructs the GenericArrayData of InternalRow objects that power the arrays_zip function.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala – Registers the SQL function names (explode, arrays_zip, inline) and binds them to their corresponding expression classes at line 366.
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala – Provides the select, withColumn, and selectExpr methods used to construct the logical plans that invoke these expressions.

Summary

Never use multiple explode calls in the same projection, as this creates separate generator nodes that force a Cartesian product and exponential row expansion.
Use arrays_zip to merge multiple array columns into a single array of structs, then explode once to preserve positional alignment without joins.
Reference the source: The ArraysZip expression in collectionOperations.scala and ExplodeBase in generators.scala enable Whole-Stage Code Generation for optimal performance.
For maps, convert to entries using map_entries before zipping.
For existing struct arrays, use inline to skip the zip step entirely and eliminate struct creation overhead.

Frequently Asked Questions

What happens if I call explode on multiple columns without zipping?

Calling explode on multiple columns in the same SELECT clause creates multiple generator expressions in the logical plan. The Catalyst optimizer must compute the Cartesian product of these generators, resulting in a multiplicative explosion of output rows and significantly increased execution time and memory usage.

Can I use explode_outer with arrays_zip?

Yes. You can substitute explode_outer for explode when working with zipped arrays. This preserves rows where the zipped array is empty or null by returning NULL values for the struct fields, maintaining outer join semantics while still avoiding the Cartesian product penalty of multiple generators.

How do I handle columns of different lengths when exploding multiple columns?

The arrays_zip function aligns elements by position and stops at the length of the shortest array. To retain all elements from longer arrays, pad the shorter arrays with null values using concat and array functions before zipping, or handle the remaining elements in a separate operation.

When should I use inline instead of arrays_zip?

Use inline or inline_outer when your data is already an array of structs rather than separate array columns. This function expands the struct fields directly into columns without requiring the intermediate arrays_zip step, eliminating struct creation overhead and providing the most efficient path for pre-structured array data.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how apache/spark works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →